Course on CUDA Programming

Course on CUDA Programming on NVIDIA GPUs, July 22-26, 2024

The course will be taught by Prof. Mike Giles and Prof. Wes Armour. They have both used CUDA in their research for many years, and set up and manage JADE, the first national GPU supercomputer for Machine Learning.

We are now ready for online registration here.

Note that attendance is free for those from Oxford University but online registration is required.

This is a one-week hands-on course for students, postdocs, academics and others who want to learn how to develop applications to run on NVIDIA GPUs using the CUDA programming environment. All that will be assumed is some proficiency with C and basic C++ programming. No prior experience with parallel computing will be assumed.

The course consists of approximately 3 hours of lectures and 4 hours of practicals each day. The aim is that by the end of the course you will be able to write relatively simple programs and will be confident and able to continue learning through studying the examples provided by NVIDIA on GitHub.

All attendees should bring a laptop to access the GPUs servers which will be used for the practicals.

The costs for the course are:

free for everyone in Oxford (due to central funding)
£250 for those from other UK universities
£500 for those from UK government labs, UK not-for-profit organisations, and foreign universities
£2500 for those from industry and foreign government labs

Anyone with a status which does not fit into one of the categories above, including those outside the UK who are not from a university, company or government lab, should contact me (mike.giles@maths.ox.ac.uk) to discuss the appropriate fee category.

The intention is that these costs should not deter anyone from attending the course. The higher costs for certain participants correspond to the fact that they will be paying more for their travel and accommodation, and/or their organisations will be paying more for their time spent attending the course. It also reflects the UK funding for the facilities being used.

We think it is much better for people to attend the course in person so that we can support everyone with the practicals, but if there is interest in remote attendance, particularly from those in industry who may be unable to come to Oxford for a week, then we are open to the possibility -- please discuss with me (mike.giles@maths.ox.ac.uk) how best we could support you with the practicals.

We are now ready for online registration here.

To encourage early registration, the costs will increase by 50% after July 1st.

Venue

The lectures and practicals will all take place in the Mathematical Institute. Attendees should bring laptops for accessing the remote servers to carry out the practicals. It would be good to use fully-charged laptops, but we will try to provide adequate charging points as far as possible.

Travel to Oxford

For those coming to Oxford, especially from abroad, there is travel advice here.

Accommodation and food

Those attending the course must arrange their own accommodation. These are within a few minutes walk (or bus ride), and are arranged roughly in order of increasing cost:

University Rooms (St. Anne's, Somerville and Keble colleges are the closest)
Premier Inn -- Westgate (15-20 minute walk)
Travelodge -- Peartree (15 minutes by bus)
easyHotel -- Oxford (10 minutes by bus)
Cotswold Lodge Hotel (10 minute walk)
Old Parsonage Hotel (5 minute walk)

Alternatively, you might consider using Airbnb.

For coffee, breakfast and lunch, there is a good cafe in the basement of the Mathematical Institute. Little Clarendon Street, which is nearby, has several restaurants for dinner (and an excellent ice cream shop), and there are two sandwich shops for lunch on either side of its junction with Woodstock Road (A4144 on Google Maps).

Timetable

For the first three days we will follow this timetable:

09:15 - 10:45 lecture
10:45 - 11:15 break
11:15 - 12:45 practical
12:45 - 14:00 lunch break
14:00 - 15:30 lecture
15:30 - 16:00 break
16:00 - 17:30 practical

On the last two days we will switch to having both lectures in the morning, and then have practicals all afternoon. This provides more time for longer practicals, and will also allow those coming to Oxford from far away to leave when they wish on Friday afternoon.

Preliminary Reading

Please read chapters 1 and 2 of the NVIDIA CUDA C Programming Guide which is available both as PDF and online HTML.

CUDA is an extension of C/C++, so if you are a little rusty with C/C++ you should refresh your memory of it.

Additional References

online CUDA documentation

CUDA homepage
CUDA Runtime API
CUDA C++ Best Practices Guide

CUDA Compiler Driver NVCC
CUDA-gdb debugger

CUDA maths library
CUBLAS library
CUFFT library
CUSPARSE library
CURAND library
NCCL multi-GPU communications library

CUDA Fortran
CUDA Fortran Programming Guide

PTX ISA (low-level instructions)

Nsight Visual Studio
Nsight Eclipse
Nsight Kernel Profiling Guide
Nsight Compute Command Line Interface (which has superseded nvprof)
Nsight Compute User Interface
Compute Sanitizer (including memchk and racecheck tools)
other Nsight tools

CUDA code samples on GitHub

OpenACC
OpenMP 5.0 for Accelerators

helper_math.h header file defining operator-overloading operations for CUDA intrinsic vector datatypes such as float4
dbldbl.h header file defining double-double arithmetic for quad-precision (originally developed by NVIDIA, but not supported)

NVIDIA webpage listing Compute Capability type of all GPUs
Wikipedia pages on NVIDIA HPC cards, and GeForce 30 and GeForce 40 graphics cards

Volta Tuning Guide
Volta V100 White Paper

Ampere Tuning Guide
Ampere A100 White Paper

Hopper Tuning Guide
Hopper H100 White Paper

arXiv paper using microbenchmarking to assess Volta memory performance (including atomics)
GTC slides on "Dissecting the Ampere GPU Architecture through Microbenchmarking"

Lectures

lecture 1: (4 slides per page) An introduction to CUDA
lecture 2: (4 slides per page) Different memory and variable types
lecture 3: (4 slides per page) Control flow and synchronisation
lecture 4: (4 slides per page) Warp shuffles, and reduction / scan operations
lecture 5: (4 slides per page) Libraries and tools
lecture 6: (4 slides per page) Multiple GPUs, and odds and ends
lecture 7: (4 slides per page) Tackling a new CUDA application
lecture 8: OP2 "Library" for Unstructured Grids research talk (MG)
lecture 9: AstroAccelerate research talk (WA)
lecture 10: (4 slides per page) Future Directions

extra research talk: Use of GPUs for Explicit and Implicit Finite Difference Methods

Practicals

Most attendees will be provided with accounts on the ARC/HTC system which has a number of NVIDIA GPU nodes. Before starting the practicals, please read these ARC notes. Some details on the Slurm batch queueing system are available here.

Those with accounts on JADE may prefer to use it for their practicals.

The practicals all use these header files (helper_cuda.h, helper_string.h) which came originally from the CUDA SDK. They provide routines for error-checking and initialisation.

Tar files for all practicals

practicals.tar.gz contains the Makefile version of the practicals

Practical 1

Application: a trivial "hello world" example

CUDA aspects: launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code

Note: the instructions explain how files can be copied from my user account so there's no need to download from here

Practical 2

Application: Monte Carlo simulation using NVIDIA's CURAND library for random number generation

CUDA aspects: constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements

Practical 3

Application: 3D Laplace finite difference solver

CUDA aspects: thread block size optimisation, multi-dimensional memory layout, performance profiling

Practical 4

Application: reduction

CUDA aspects: dynamic shared memory, thread synchronisation

instructions (PDF)
reduction.cu
Makefile
round_up_test.c code to round an integer up to nearest power of 2

Practical 5

Application: using the CUBLAS and CUFFT libraries

Practical 6

Application: revisiting the simple "hello world" example

CUDA aspects: using g++ for the main code, building libraries, using templates

Practical 7

Application: tri-diagonal equations

Practical 8

Application: scan operation and recurrence equations

Practical 9

Application: pattern matching

Practical 10

Application: auto-tuning

instructions (PDF)
Flamingo auto-tuning software

Practical 11

Application: streams and OpenMP multithreading

Practical 12

Application: more on streams and overlapping computation and communication

Acknowledgements

Many thanks to:

the Mathematical Institute for hosting the lectures and practicals
Oxford's Advanced Research Computing for the GPU servers used in the practicals

webpage link checker