This is a one-week hands-on course on how to develop applications to run on NVIDIA GPUs using the CUDA programming environment. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

- 10:00 - 11:30 lecture
- 11:30 - 12:00 break
- 12:00 - 13:30 practical
- 13:30 - 14:30 lunch break
- 14:30 - 16:00 lecture
- 16:00 - 16:30 break
- 16:30 - 18:00 practical

- online CUDA documentation
- CUDA homepage
- CUDA Runtime API
- CUDA C++ Best Practices Guide
- CUDA maths library
- CUBLAS library
- CUFFT library
- CUSPARSE library
- CURAND library
- NCCL multi-GPU communications library
- Nsight Visual Studio
- Nsight Eclipse
- Nsight Kernel Profiling Guide
- Nsight Compute Command Line Interface
- Nsight Compute User Interface
- Compute Sanitizer (including memchk and racecheck tools)
- other Nsight tools
- CUDA code samples on GitHub
- NVIDIA webpage listing Compute Capability type of all GPUs
- Wikipedia pages on NVIDIA HPC cards, and GeForce 30 and GeForce 40 graphics cards
- Volta Tuning Guide
- Volta V100 White Paper
- Ampere Tuning Guide
- Ampere A100 White Paper

- lecture 1: (4 slides per page) An introduction to CUDA
- lecture 2: (4 slides per page) Different memory and variable types
- lecture 3: (4 slides per page) Control flow and synchronisation
- lecture 4: (4 slides per page) Warp shuffles, and reduction / scan operations
- lecture 5: (4 slides per page) Libraries and tools
- lecture 6: (4 slides per page) Multiple GPUs, and odds and ends
- lecture 7: (4 slides per page) Tackling a new CUDA application (on Friday)
- lecture 8: (4 slides per page) AstroAccelerate research talk
- lecture 9: (4 slides per page) Future Directions
- lecture 10: guest lecture from NVIDIA
- extra research talks (not lectured):

Use of GPUs for Explicit and Implicit Finite Difference Methods

OP2 "Library" for Unstructured Grids

Practical 1 is mandatory but is not assessed. Practicals 2-4 are to be completed for assessment. Practicals 7-8 are optional and particularly for those who may want to give a presentation on one of these topics.

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

- instructions (PDF)
- some mathematical notes (PDF)
- Google Colab notebook

CUDA aspects: thread block size optimisation, multi-dimensional memory layout, performance profiling

CUDA aspects: dynamic shared memory, thread synchronisation

- instructions (PDF)
- round_up_test.c code to round an integer up to nearest power of 2
- Google Colab notebook

- Parallel scan for radix sort of integers
- Parallel scan for recurrence equations
- Solution of tri-diagonal equations
- Use of tensor cores for matrix-matrix multiplication

- the Mathematical Institute for hosting the lectures
- the Maths Events team for the livestreaming of the lectures
- Emmanuel Ahenkan and Tlotlo Oepeng for helping with the practicals
- Google for the Google Colab system

webpage link checker