Course on CUDA Programming

CUDA Programming on NVIDIA GPUs, March 23-25, 2026, at UT Austin

This will be a 3-day hands-on course for students, postdocs, academics and others who want to learn how to develop applications to run on NVIDIA GPUs using the CUDA programming environment. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

The course consists of approximately 3 hours of lectures and 3 hours of practicals for each of the first two days, plus 6 hours of lectures on the third day. Additional advanced practicals can be completed afterwards.

The aim is that by the end of the course you will be able to write relatively simple programs and will be confident and able to continue learning through studying the examples provided by NVIDIA on GitHub.

There will be time during March 26-27 for follow-on discussions on the use of CUDA for specific research projects.

Venue

The lectures and practicals will all take place in POB Seminar Room 6.304 in the Oden Institute. Attendees should bring fully-charged laptops for carrying out the practicals on TACC.

Timetable

For the first two days we will follow this approximate timetable:

08:00 - 09:30 lecture
09:30 - 10:00 break
10:00 - 11:30 practical
11:30 - 12:30 lunch break
12:30 - 14:00 lecture
14:00 - 14:30 break
14:30 - 16:00 practical

On the third day we will switch to having two lectures in the morning, and two in the afternoon.

Preliminary Reading

Please read sections 1.1 and 1.2 of the new NVIDIA CUDA Programming Guide which is available both as PDF and online HTML.

CUDA is an extension of C/C++, so if you are a little rusty with C/C++ you should refresh your memory of it. Here are links to a couple of introductory lectures on C and an online resource.

Additional References

online CUDA documentation

CUDA Compiler Driver NVCC
CUDA-gdb debugger

CUDA Fortran
CUDA Fortran Programming Guide

PTX ISA (low-level instructions)

Nsight Visual Studio
Nsight Eclipse
Nsight Kernel Profiling Guide
Nsight Compute Command Line Interface
Nsight Compute User Interface
Compute Sanitizer (including memchk and racecheck tools)
other Nsight tools

CUDA code samples on GitHub (an old version also available on Frontera at /opt/apps/cuda/11.3/samples/)

NVIDIA webpage listing Compute Capability type of all GPUs
Wikipedia pages on NVIDIA HPC cards, and GeForce 40 and GeForce 50 graphics cards

Hopper Tuning Guide
Hopper H100 White Paper

Blackwell Tuning Guide
Blackwell White Paper

Quadro RTX 5000 datasheet

Lectures

lecture 1: An introduction to CUDA
lecture 2: Different memory and variable types
lecture 3: Control flow and synchronisation
lecture 4: Warp shuffles, and reduction / scan operations
lecture 5: Tensor cores, libraries and tools
lecture 6: Streams, and odds and ends
lecture 7: Tackling a new CUDA application
lecture 8: Looking to the future

extra research talks (not presented):
FlashAttention -- an interesting CUDA application
Automated CUDA code generation
Sparse matrix-vector multiplication
Use of GPUs for Explicit and Implicit Finite Difference Methods
OP2 "Library" for Unstructured Grids

Practicals

We will be working under Linux on GPU nodes which are part of TACC's Frontera system. Before starting the practicals, please read these notes on using the Frontera system, and have a look at the online Frontera User Guide.

Datasheet for Quadro RTX 5000 GPU which we will be using in our practicals.

The practicals all use these header files (helper_cuda.h, helper_string.h) which came originally from the CUDA SDK. They provide routines for error-checking and initialisation.

Tar files for all practicals

practicals.tar.gz contains the Makefile version of the practicals

Practical 1

Application: a trivial "hello world" example

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

Note: the Frontera notes explain how the files for all of the practicals can be obtained from my master tar file, so there's no need to download individual files from here

Practical 2

Application: Monte Carlo simulation using NVIDIA's CURAND library for random number generation

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

Practical 3

Application: 3D Laplace finite difference solver

CUDA aspects: thread block size optimisation, multi-dimensional memory layout, performance profiling

Practical 4

Application: reduction

CUDA aspects: dynamic shared memory, thread synchronisation, shuffles, atomics

instructions (PDF)
reduction.cu
Makefile
round_up_test.c code to round an integer up to nearest power of 2

The following practicals provide scope for additional practice after the course is over.

Practical 5

Application: using Tensor Cores and cuBLAS and other libraries

Practical 6

Application: revisiting the simple "hello world" example

CUDA aspects: using g++ for the main code, building libraries, using templates

Practical 7

Application: tri-diagonal equations

Practical 8

Application: scan operation and recurrence equations

Practical 9

Application: pattern matching

Practical 10

Application: auto-tuning

instructions (PDF)
Flamingo auto-tuning software

Practical 11

Application: streams and OpenMP multithreading

Practical 12

Application: more on streams and overlapping computation and communication

Acknowledgements

Many thanks to:

TACC for the GPU resources

webpage link checker