Course on CUDA Programming

Course on CUDA Programming on NVIDIA GPUs, Nov 28 - Dec 9, 2022, at UT Austin

This is a 2-week hands-on course for students, postdocs, academics and others who want to learn how to develop applications to run on NVIDIA GPUs using the CUDA programming environment. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

The course consists of approximately 1.5 hours of lectures and 2 hours of practicals each afternoon. The aim is that by the end of the course you will be able to write relatively simple programs and will be confident and able to continue learning through studying the examples provided by NVIDIA on GitHub.

All attendees should bring a laptop to access the GPUs servers on TACC.

Venue

The lectures and practicals will all take place in room 4.304 in the Oden Institute. Attendees should bring fully-charged laptops for carrying out the practicals.

Timetable

The course will follow the following timetable:

13:00 - 14:30 lecture
14:30 - 14:45 break
14:45 - 18:00 practical

The practical should take only about 2 hours, but the room is reserved until 18:00 in case some people may have to go off to lectures, seminars, group meetings at some point during the afternoon.

Preliminary Reading

Please read chapters 1 and 2 of the NVIDIA CUDA C Programming Guide which is available both as PDF and online HTML.

CUDA is an extension of C/C++, so if you are a little rusty with C/C++ you should refresh your memory of it.

Additional References

online CUDA documentation

CUDA homepage
CUDA Runtime API
CUDA C++ Best Practices Guide

CUDA Compiler Driver NVCC
CUDA-gdb debugger

CUDA maths library
CUBLAS library
CUFFT library
CUSPARSE library
CURAND library
NCCL multi-GPU communications library

CUDA Fortran
CUDA Fortran Programming Guide

PTX (low-level instructions)

Nsight Visual Studio
Nsight Eclipse
Nsight Kernel Profiling Guide
Nsight Compute Command Line Interface (which has superseded nvprof)
Nsight Compute User Interface
Compute Sanitizer (including memchk and racecheck tools)
other Nsight tools

CUDA code samples on GitHub (also available on Lonestar6 at /opt/apps/cuda/11.4/samples/)

OpenACC
OpenMP 5.0 for Accelerators

helper_math.h header file defining operator-overloading operations for CUDA intrinsic vector datatypes such as float4
dbldbl.h header file defining double-double arithmetic for quad-precision (originally developed by NVIDIA, but not supported)

NVIDIA webpage listing Compute Capability type of all GPUs
Wikipedia pages on NVIDIA HPC cards, and GeForce 30 and GeForce 40 graphics cards

Volta Tuning Guide
Volta V100 White Paper

Ampere Tuning Guide
Ampere A100 White Paper

Hopper Tuning Guide
Hopper H100 White Paper

arXiv paper using microbenchmarking to assess Volta memory performance (including atomics)

Lectures

lecture 1: (4 slides per page) An introduction to CUDA
lecture 2: (4 slides per page) Different memory and variable types
lecture 3: (4 slides per page) Control flow and synchronisation
lecture 4: (4 slides per page) Warp shuffles, and reduction / scan operations
lecture 5: (4 slides per page) Libraries and tools
lecture 6: (4 slides per page) Multiple GPUs, and odds and ends
lecture 7: (4 slides per page) Tackling a new CUDA application

lecture 8: possible research talks, out of which I have selected:
Use of GPUs for Explicit and Implicit Finite Difference Methods
OP2: An Active Library Framework for Unstructured Mesh Applications

lecture 9: Thoughts from TACC on Next Generation Academic Supercomputing Systems, guest lecture by Dan Stanzione (TACC)
lecture 10: guest lecture by Paul Bauman (AMD)

Practicals

We will be working under Linux on GPU nodes which are part of TACC's Lonestar6 system. Before starting the practicals, please read these notes on using the Lonestar6 system, and have a look at the online Lonestar6 User Guide.

The practicals all use these header files (helper_cuda.h, helper_string.h) which came originally from the CUDA SDK. They provide routines for error-checking and initialisation.

Tar files for all practicals

practicals.tar.gz contains the Makefile version of the practicals
practicals_nsight.tar.gz contains an old Nsight (Eclipse) version of the practicals

Practical 1

Application: a trivial "hello world" example

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

Note: the instructions explain how files can be copied from my user account so there's no need to download from here

Practical 2

Application: Monte Carlo simulation using NVIDIA's CURAND library for random number generation

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

Practical 3

Application: 3D Laplace finite difference solver

CUDA aspects: thread block size optimisation, multi-dimensional memory layout

Practical 4

Application: reduction

CUDA aspects: dynamic shared memory, thread synchronisation

Practical 5

Application: using the CUBLAS and CUFFT libraries

Practical 6

Application: revisiting the simple "hello world" example

CUDA aspects: using g++ for the main code, building libraries, using templates

Practical 7

Application: tri-diagonal equations

Practical 8

Application: scan operation and recurrence equations

Practical 9

Application: pattern matching

Practical 10

Application: auto-tuning

instructions (PDF)
Flamingo auto-tuning software

Practical 11

Application: streams and OpenMP multithreading

Practical 12

Application: more on streams and overlapping computation and communication

Acknowledgements

Many thanks to:

TACC for the GPU resources
Will Ruys for his help as TA
Dan Stanzione and Paul Bauman for their presentations

webpage link checker