

# Learning outcomes In this stath lecture we will look at CUDA streams and how they can be used to increase performance in GPU computing. . will learn about: . spichronicity between host and device. . Multiple streams and devices. . the we nultiple GPUs.

































# Asynchronous host code

When using asynchronous calls, things to watch out for, and things that can go wrong are:

- · Kernel timing need to make sure it's finished.
- · Could be a problem if the host uses data which is read/written directly by kernel, or transferred by cudaMemcpyAsync().
- cudaDeviceSynchronize() can be used to ensure correctness (similar to syncthreads () for kernel code).



Lecture 6

~







# Page-locked / Pinned memory

Section 3.2.6 of the cuda programming guide:

- To achieve asynchronous behaviour you must use page-locked memory with cudaMemcpyAsync();
- Host memory is usually paged, so run-time system keeps track of where each page is located.
- For higher performance, pages can be fixed (fixed address space, always in RAM), but means less memory available for everything else.
- CUDA uses this for better host <-> GPU bandwidth, and also to hold "device" arrays in host memory.
- Can provide up to 100% improvement in bandwidth
- Page-locked memory is allocated using cudaHostAlloc(), or registered by cudaHostRegister();

Lecture 6



Pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory.

https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

25

# Example use

Use multiple streams to increase performance by overlapping memory communication with compute:

|                                                                                                                  | Stream 0 |   |   |   |   | _ |   |  |
|------------------------------------------------------------------------------------------------------------------|----------|---|---|---|---|---|---|--|
| Kernel Engine                                                                                                    |          |   |   |   | 0 |   |   |  |
| D2H Engine                                                                                                       |          |   |   |   |   |   | 0 |  |
|                                                                                                                  | 1        | 2 | 3 | 4 |   |   |   |  |
| H2D Engine                                                                                                       | 1        | 2 | 3 | 4 |   |   |   |  |
| Kernel Engine                                                                                                    |          | - |   |   |   |   |   |  |
| Kernel Engine<br>D2H Engine                                                                                      |          | 1 | 2 | 3 | 4 |   |   |  |
| and the second |          | Ĩ | 2 | 3 | 4 |   |   |  |





0



### Practical 11 An example is given in practical 11 for those interested, try with the two different flags: cudaStream\_t streams[8]; float \*data[8]; for (int i = 0; i < 8; i++) { cudaStreamCreate(&streams[i]); cudaMalloc(&data[i], N \* sizeof(float)); // launch one worker kernel per stream kernel<<<1, 64, 0, streams[i]>>>(data[i], N); $//\mbox{ do}$ a Memcpy and launch a dummy kernel on default stream cudaMemcpy(d\_data,h\_data,sizeof(float), cudaMemcpyHostToDevice); kernel<<<1, 1>>>(d\_data, 0); cudaDeviceSynchronize(); Lecture 6 30

29





Stream commands Functions useful for synchronisation and timing between streams: cudaEventCreate(event) Creates an "event". cudaEventRecord(event,stream) Puts an event into a stream (by default, stream 0). cudaEventSynchronize(event) CPU waits until event occurs. cudaStreamWaitEvent(stream, event) Stream waits until event occurs (doesn't block the host). cudaEventQuery(event) Check whether event has occurred. cudaEventElapsedTime(time,event1,event2) Times between event1 and event2. Lecture 6 34

34





Lecture 6

36

0

# Multiple devices

If a user is running on **multiple GPUs**, data can go directly between GPUs (**peer – peer**), it doesn't have to go via CPU.

This is the **premise of the NVlink interconnect**, which is much faster than PCIe (900GB/s P2P on Hopper).

 $\mathtt{cudaMemcpy}\left( \right)$  can do direct copy from one GPU's memory to another.

A kernel on one GPU can also read directly from an array in another GPU's memory, or write to it. This even includes the ability to do atomic operations with remote GPU memory.

For more information see Section 6.13, "Peer Device Memory Access" in CUDA Runtime API documentation: https:///docs.nvidia.com/cuda/cuda-runtime-api/ https://kwikibip.org/rews/122/a-lobd-at-nvidias-rvilnk-interconnect-and-the-nvwitch/





# Multi-GPU computing

Multi-GPU computing exists at all scales, from cheaper workstations using PCIe, to more expensive Quadro / Titan products using fewer NVLink, to high-end NVIDIA DGX servers.

### Single workstation / server:

- a big enclosure for good cooling!
- up to 4 high-end cards in 16x PCIe v4 slots up to 16GB/s interconnect.
   2x high-end CPUs.
- 2-3kW power consumption not one for the office!

### NVIDIA DGX H100 Deep Learning server:

- 8 NVIDIA GH100 GPUs, each with 80GB HBM2.
- 2× 56-core Intel Xeons (Platinum 8480C 2.0 GHz).
- 2 TB RAM memory, 8x 3.84TB NVMe.
- 900GB/s NVlink interconnect between the GPUs.
- ~£379,000???



38

37



Lecture 6

# <text><list-item><list-item><list-item><list-item><section-header><list-item><list-item><list-item><list-item><list-item><list-item>

Lecture 6

40













## Loose ends - Compilation

### Prac 6 Makefile:

INC := -I\$(CUDA\_HOME)/include -I. LIB := -L\$(CUDA\_HOME)/lib64 -lcudart FLAGS := --ptxas-options=-v --use\_fast\_math

main.o: main.cpp g++ -c -fPIC -o main.o main.cpp

prac6.o: prac6.cu nvcc prac6.cu -c -o prac6.o \$(INC) \$(FLAGS)

prac6: main.o prac6.o g++ -fPIC -o prac6 main.o prac6.o \$(LIB)

49

# Lecture 6 49 50 Loose ends - Compilation Launch bounds (10.36): kernels. This specifies GPU architecture (in this case sm 80 is for

-maxrregcount=n This asks the compiler to generate code using at most n registers; the compiler may ignore this if it's not possible, but it may also increase register usage up to this limit.

Other useful compiler options:

-arch=sm 80

Ampere A100).

This is less important now since threads can have up to 255 registers, but can be useful in some instances to reduce register pressure and enable more thread blocks to run.



51

Lecture 6

## Loose ends - Compilation

### Prac 6 Makefile to create a library:

INC := -I\$(CUDA)/include -I. LIB := -L\$(CUDA)/lib64 -lcudart FLAGS := --ptxas-options=-v --use\_fast\_math

main.o: main.cpp g++ -c -fPIC -o main.o main.cpp

prac6.a: prac6.cu nvcc prac6.cu -lib -o prac6.a \$(INC) \$(FLAGS)

prac6a: main.o prac6.a g++ -fPIC -o prac6a main.o prac6.a \$(LIB)



Lecture 6