Course on CUDA Programming, November 30 -- December 15, 2023, at AIMS South Africa

This is a 2.5 week course to learn how to develop parallel applications to run on NVIDIA GPUs. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

The aim is that by the end of the course you will be able to write relatively simple parallel programs, and will feel confident to continue learning to use CUDA through studying the code samples provided by NVIDIA on GitHub.


CUDA Programming references

As preliminary reading, please read chapters 1 and 2 of the NVIDIA CUDA C Programming Guide which is available both as PDF and online HTML.

There is lots of other information available online. You might find some of this useful, but you definitely don't need to read most of it.

Lectures

Week 1: Week 2: Week 3: Week 4?


Practicals

The practicals all use these header files (helper_cuda.h, helper_string.h) which came originally from the CUDA SDK. They provide routines for error-checking and initialisation.

Practical 1

Application: a trivial "hello world" example

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

Practical 2

Application: Monte Carlo simulation using NVIDIA's CURAND library for random number generation

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

Practical 3

Application: 3D Laplace finite difference solver

CUDA aspects: thread block size optimisation, multi-dimensional memory layout

Practical 4

Application: reduction

CUDA aspects: dynamic shared memory, thread synchronisation

The practicals below are optional for those interested in additional experiments. Some could be the subject of end-of-course presentations.

Practical 5

Application: using the CUBLAS and CUFFT libraries

Practical 6

Application: revisiting the simple "hello world" example

CUDA aspects: using g++ for the main code, building libraries, using templates

Practical 7

Application: tri-diagonal equations

Practical 8

Application: scan operation and recurrence equations

Practical 9

Application: pattern matching

Practical 10

Application: auto-tuning

Practical 11

Application: streams and OpenMP multithreading

Practical 12

Application: more on streams and overlapping computation and communication

Ideas for presentation topics



Acknowledgements

Many thanks to:
webpage link checker