Course on CUDA Programming

Course on CUDA Programming on NVIDIA GPUs, Feb 3-21, 2025

This is a 3 week course to learn how to develop parallel applications to run on NVIDIA GPUs. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

The aim is that by the end of the course you will be able to write relatively simple parallel programs, and will feel confident to continue learning to use CUDA through studying the code samples provided by NVIDIA on GitHub.

CUDA Programming references

As preliminary reading, please read chapters 1 and 2 of the NVIDIA CUDA C Programming Guide which is available both as PDF and online HTML.

CUDA is an extension of C/C++, so if you are a little rusty with C/C++ you should refresh your memory of it. Here are links to a couple of introductory lectures on C and an online resource.

There is lots of other information available online. You might find some of this useful, but you definitely don't need to read most of it.

online CUDA documentation

CUDA code samples on GitHub

NVIDIA webpage listing Compute Capability type of all GPUs
Wikipedia pages on NVIDIA HPC cards, and GeForce 40 and GeForce 50 graphics cards

Volta Tuning Guide
Volta V100 White Paper

Ampere Tuning Guide
Ampere A100 White Paper

Hopper Tuning Guide
Hopper H100 White Paper

Blackwell White Paper

NVIDIA DIGITS system

NVIDIA T4 datasheet

Lectures

lecture 1: (4 slides per page) An introduction to CUDA
lecture 2: (4 slides per page) Different memory and variable types
lecture 3: (4 slides per page) Control flow and synchronisation
lecture 4: (4 slides per page) Warp shuffles, and reduction / scan operations
lecture 5: (4 slides per page) Tensor cores, libraries and tools
lecture 6: (4 slides per page) Streams, and odds and ends
lecture 7: (4 slides per page) Tackling a new CUDA application
lecture 8: (4 slides per page) Looking to the future

extra research talks -- I won't present them all:
FlashAttention -- an interesting CUDA application
Automated CUDA code generation
Sparse matrix-vector multiplication
Use of GPUs for Explicit and Implicit Finite Difference Methods
OP2 "Library" for Unstructured Grids

Practicals

These will be carried out on Google Colab, with your modified notebooks automatically stored on your Google Drive.

Practical 1 is mandatory but is not assessed. Practicals 2-4 are to be completed for assessment. Practicals 7-8 are optional and particularly for those who may want to give a presentation on one of these topics.

Practical 1

Application: a trivial "hello world" example

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

instructions (PDF)
Google Colab notebook

Practical 2

Application: Monte Carlo simulation using NVIDIA's CURAND library for random number generation

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

Practical 3

Application: 3D Laplace finite difference solver

CUDA aspects: thread block size optimisation, multi-dimensional memory layout, performance profiling

Practical 4

Application: reduction

CUDA aspects: dynamic shared memory, thread synchronisation, shuffles, atomics

instructions (PDF)
round_up_test.c code to round an integer up to nearest power of 2
Google Colab notebook

Practical 7

Application: tri-diagonal equations -- see Lecture 7, slide 8, and also this research talk

instructions (PDF)
Google Colab notebook

Practical 8

Application: scan operation and recurrence equations -- see Lecture 4

instructions (PDF)
Google Colab notebook

Ideas for presentation topics

Parallel scan for radix sort of integers
Parallel scan for recurrence equations
Solution of tri-diagonal equations
Use of tensor cores for matrix-matrix multiplication

Acknowledgements

Many thanks to:

Google for the Google Colab system

webpage link checker