# Profiling & Tuning Applications

CUDA Course

István Reguly



#### Introduction

- Why is my application running slow?
- Work it out on paper
- Instrument code
- Profile it
  - NVIDIA Visual Profiler
    - Works with CUDA, needs some tweaks to work with OpenCL
  - nvprof command line tool, can be used with MPI applications

# Identifying Performance Limiters

- CPU: Setup, data movement
- GPU: Bandwidth, compute or latency limited
- Number of instructions for every byte moved
  - ~3.6:1 on Fermi
  - ~6.4 : 1 on Kepler
- Algorithmic analysis gives a good estimate
- Actual code is likely different
  - Instructions for loop control, pointer math, etc.
  - Memory access patterns
  - How to find out?
    - Use the profiler (quick, but approximate)
    - Use source code modification (takes more work)

## Analysis with Source Code Modification

- Time memory-only and math-only versions
  - Not so easy for kernels with data-dependent control flow
  - Good to estimate time spent on accessing memory or executing instructions
- Shows whether kernel is memory or compute bound
- Put an "if" statement depending on kernel argument around math/mem instructions
  - Use dynamic shared memory to get the same occupancy

## Analysis with Source Code Modification

```
__global__ void kernel(float *a) {
int idx = threadIdx.x + blockDim.x+blockIdx.x;
float my_a;
my_a = a[idx];
for (int i =0; i < 100; i++) my_a = sinf(my_a+i*3.14f);
a[idx] = my_a;
}</pre>
```

# Example scenarios



## **NVIDIA Visual Profiler**

- Launch with "nvvp"
- Collects metrics and events during execution
  - Calls to the CUDA API
  - Overall application:
    - Memory transfers
    - Kernel launches
  - Kernels
    - Occupancy
    - Computation efficiency
    - Memory bandwidth efficiency
- Requires deterministic execution!

# Meet the test setup

- 2D gaussian blur with a 5x5 stencil  $\frac{1}{273}$
- 4096^2 grid

| 1 | 4  | 7  | 4  | 1 |
|---|----|----|----|---|
| 4 | 16 | 26 | 16 | 4 |
| 7 | 26 | 41 | 26 | 7 |
| 4 | 16 | 26 | 16 | 4 |
| 1 | 4  | 7  | 4  | 1 |

# Meet the test setup

- NVIDIA K40
  - GK110B
  - SM 3.5
  - ECC on
  - Graphics clocks at 745MHz, Memory clocks at 3004MHz

#### • CUDA 7.0

nvcc profiling\_lecture.cu -02 -arch=sm\_35 -I. -lineinfo -DIT=0

# Interactive demo of tuning process

# Launch a profiling session



## First look



## The Timeline



# Analysis

Guided



#### 1. CUDA Application Analysis

The guided analysis system walks you through the various analysis stages to help you understand the optimization opportunities in your application. Once you become familiar with the optimization process, you can explore the individual analysis stages in an unguided mode. When optimizing your application it is important to fully utilize the compute and data movement capabilities of the GPU. To do this you should look at your application's overall GPU usage as well as the performance of individual kernels.

#### 🖳 Examine GPU Usage

Determine your application's overall GPU usage. This analysis requires an application timeline, so your application will be run once to collect it if it is not already available.

#### 🖳 Examine Individual Kernels

Determine which kernels are the most performance critical and that have the most opportunity for improvement. This analysis requires utilization data from every kernel, so your application will be run once to collect that data if it is not already available.

#### ■ Delete Existing Analysis Information

If the application has changed since the last analysis then the existing analysis information may be stale and should be deleted before continuing.

Switch to unquided analysis

#### **Unguided**



## Examine Individual Kernels



Lists all kernels sorted by total execution time: the higher the rank the higher the impact of optimisation on overall performance

Initial unoptimised (v0) 8.122ms

#### Utilisation

#### -Results-

#### i Kernel Performance Is Bound By Instruction And Memory Latency

This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of "Tesla I utilization levels indicate that the performance of the kernel is most likely limited by the latency of arithmetic or memory of Achieved compute throughput and/or memory bandwidth below 60% of peak typically indicates latency issues.







Memory throttle -> perform BW analysis

# Memory Bandwidth analysis



L1 cache not used...

# Investigate further...



6-8 transactions per access – something is wrong with how we access memory

Global memory load efficiency 53.3% L2 hit rate 96.7%

## Iteration 1 – turn on L1



Quick & easy step: Turn on L1 cache by using -Xptxas -dlcm=ca

| ✓ Line / File | profiling_lecture.cu - /home/mgiles/ireguly/cuda_course                                                               |
|---------------|-----------------------------------------------------------------------------------------------------------------------|
| 25            | Global Load L2 Transactions/Access = 20, Ideal Transactions/Access = 4 [ 10485760 L2 transactions for 524288 total ex |
| 25            | Global Load L2 Transactions/Access = 18, Ideal Transactions/Access = 4 [ 9437184 L2 transactions for 524288 total exe |
| 25            | Global Load L2 Transactions/Access = 20, Ideal Transactions/Access = 4 [ 10485760 L2 transactions for 524288 total ex |
| 25            | Global Load L2 Transactions/Access = 18, Ideal Transactions/Access = 4 [ 9437184 L2 transactions for 524288 total exe |

Memory unit is utilized, but Global Load efficiency became even worse: 20.5%

| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| Enable L1                | 6.57ms  |

## Cache line utilization



## Cache line utilization



128 bytes (32 floats) Unit of transaction

Each time a transaction requires more than 1 128B cache line: re-issue

L1 cache enabled:

- -> 128B transactions
- -> 4\*32B to L2

Min 16, Max 32 transactions

# Cache line utilization





128 bytes (32 floats) Unit of transaction

L1 cache enabled:

- -> 128B transactions
- -> 4\*32B to L2

Min 4, Max 8 transactions

## Iteration 2 – 32x2 blocks



Memory utilization decreased 10% Performance almost doubles Global Load Efficiency 50.8%

| $\nabla$ | Line / File |  |
|----------|-------------|--|
|          | 25          |  |
|          | 25          |  |
|          | 25          |  |

profiling\_lecture.cu - /home/mgiles/ireguly/cuda\_course

Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transactions for 524288 total exec Global Load L2 Transactions/Access = 7.5, Ideal Transactions/Access = 4 [ 3932160 L2 transactions for 524288 total exec Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transactions for 524288 total exec

| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| Enable L1                | 6.57ms  |
| Blocksize                | 3.4ms   |

# Key takeaway

- Latency/Bandwidth bound
- Inefficient use of memory system and bandwidth
- Symptoms:
  - Lots of transactions per request (low load efficiency)
- Goal:
  - Use the whole cache line
  - Improve memory access patterns (coalescing)
- What to do:
  - Align data, change block size, change data layout
  - Use shared memory/shuffles to load efficiently

#### Stall Reasons



Optimization: Increase the number of threads in each block to increase the number of warps that can execute on each SM. More...

|                  |          | l _ı ı      | l            | l         |          |        |        |       |        |         |            |
|------------------|----------|-------------|--------------|-----------|----------|--------|--------|-------|--------|---------|------------|
| Variable         | Achieved | Theoretical | Device Limit | Grid Size | e:[ 128, | 2048,1 | ](2621 | 44 bl | ocks)E | Block S | Size: [ 3: |
| Occupancy Per SM |          |             |              |           |          |        |        |       |        |         |            |
| Active Blocks    |          | 16          | 16           | 0         | 2 4      | 6      | 8      | 10    | 12     | 14      | 16         |
| Active Warps     | 26.67    | 32          | 64           | 0         | 9 :      | 18 2   | 27 36  | 6     | 45     | 54      | 6534       |
| Active Threads   |          | 1024        | 2048         | 0         | 51       | 2      | 1024   |       | 1536   |         | 2048       |
| Occupancy        | 41.7%    | 50%         | 100%         | 0%        | 25       | 5%     | 50%    |       | 75%    | ó       | 100%       |
| Warps            |          |             |              |           |          |        |        |       |        |         |            |
| Threads/Block    |          | 64          | 1024         | 0         | 25       | 6      | 512    |       | 768    |         | 1024       |
| Warps/Block      |          | 2           | 32           | Ö         | 4 8      | 12     | 16     | 20    | 24     | 28      | 32         |
| Block Limit      |          | 32          | 16           | 0         | 2 4      | 6      | 8      | 10    | 12     | 14      | 16         |



Increase the block size so more warps can be active at the same time.

Kepler: Max 16 blocks per SM Max 2048 threads per SM

# Occupancy – using all "slots"





Increase block size to 32x4

## Iteration 3 – 32x4 blocks



| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| Enable L1                | 6.57ms  |
| Blocksize                | 3.4ms   |
| Blocksize 2              | 2.36ms  |

# Key takeaway

- Latency bound low occupancy
- Unused cycles, exposed latency
- Symptoms:
  - High execution/memory dependency, low occupancy
- Goal:
  - Better utilise cycles by: having more warps
- What to do:
  - Determine occupancy limiter (registers, block size, shared memory) and vary it

# Improving memory bandwidth

- L1 is fast, but a bit wasteful (128B loads)
  - 8 transactions on average (minimum would be 4)
- Load/Store pipe stressed
  - Any way to reduce the load?
- Texture cache
  - Dedicated pipeline
  - 32 byte loads
  - const \_\_restrict\_\_ \*
  - \_\_ldg()



## Iteration 4 – texture cache



#### Texture Cache

| Reads 65536000 1,382.851 GB/s | Idle Low Medium High Max |
|-------------------------------|--------------------------|
|-------------------------------|--------------------------|

Davica Mamana

| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| Blocksize 2              | 2.36ms  |
| Texture cache            | 1.53ms  |

# Key takeaway

- Bandwidth bound Load/Store Unit
- LSU overutilised
- Symptoms:
  - LSU pipe utilisation high, others low
- Goal:
  - Better spread the load between other pipes: use TEX
- What to do:
  - Read read-only data through the texture cache
  - const \_\_restrict\_\_ or \_\_ldg()

# Compute analysis





Compute utilization could be higher (~78%)
Lots of Integer & memory instructions, fewer FP
Integer ops have lower throughput than FP
Try to amortize the cost: increase compute per byte

## Instruction Level Parallelism



• Remember, GPU is in-order:

$$a=b+c$$
  $a=b+c$   $d=a+e$   $d=e+f$ 

- Second instruction cannot be issued before first
  - But it can be issued before the first finishes if there is no dependency
- Applies to memory instructions too latency much higher (counts towards stall reasons)

## Instruction Level Parallelism

```
for (j=0; j<2; j++)
 acc+=filter[j]*input[x+j];
tmp=input[x+0]
acc += filter[0]*tmp
tmp=input[x+1]
acc += filter[1]*tmp
```

#pragma unroll can help ILP Create two accumulators Or...

```
for (j=0; j<2; j++) {
  acc0+=filter[j]*input[x+j];
  acc1+=filter[j]*input[x+j+1];
tmp=input[x+0]
           tmp=input[x+0+1]
acc0 += filter[0]*tmp
           aec1 += filter[0]*tmp
tmp=input[x+1]
           tmp=input[x+1+1]
acc0 += filter[1]*tmp
           acc1 += filter[1]*tmp
```

Process 2 points per thread Bonus data re-use (register caching)

# Iteration 5 – 2 points per thread





| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| Texture cache            | 1.53ms  |
| 2 points                 | 1.07ms  |

# Key takeaway

- Latency bound low instruction level parallelism
- Unused cycles, exposed latency
- Symptoms:
  - High execution dependency, one "pipe" saturated
- Goal:
  - Better utilise cycles by: increasing parallel work per thread
- What to do:
  - Increase ILP by having more independent work, e.g. more than 1 output value per thread
  - #pragma unroll

# Iteration 6 – 4 points per thread



168 GB/s device BW

| Initial unoptimised (v0) | 8.122ms |
|--------------------------|---------|
| 2 points                 | 1.07ms  |
| 4 points                 | 0.95ms  |

## Conclusions

- Iterative approach to improving a code's performance
  - Identify hotspot
  - Find performance limiter, understand why it's an issue
  - Improve your code
  - Repeat
- Managed to achieve a 8.5x speedup
- Shown how NVVP guides us and helps understand what the code does
- There is more it can show...

# Metrics & Events

| evice: Tesla K20c 🔻                            | Device: Tesla K20c  ▼                   |
|------------------------------------------------|-----------------------------------------|
| Metrics Events                                 | Metrics Events                          |
| ✓   Memory                                     | ▽ □ Instruction                         |
| ☐ Requested Global Load Throughput             | ☐ elapsed_cycles_sm                     |
| ☐ Requested Global Store Throughput            | ☐ warps launched                        |
| ☐ Device Memory Read Throughput                | ☐ threads launched                      |
| ☐ Device Memory Write Throughput               | ☐ Instructions executed                 |
| ☐ Global Store Throughput                      | ☐ Instructions issued 1                 |
| ☐ Global Load Throughput                       | ☐ Instructions issued 2                 |
| ☐ Shared Memory Efficiency                     | ☐ thread inst executed                  |
| ☐ Global Memory Load Efficiency                |                                         |
| ☐ Global Memory Store Efficiency               | active cycles                           |
| ☐ Local Memory Overhead                        | active warps                            |
| Requested Non-Coherent Global Load Throughput  | sm cta launched                         |
| Local Memory Load Transactions Per Request     | not_predicated_off_thread_inst_executed |
| Local Memory Store Transactions Per Request    | ▼                                       |
| ☐ Shared Memory Load Transactions Per Request  | ☐ fb subp0 read sectors                 |
| ☐ Shared Memory Store Transactions Per Request | ☐ fb subp1 read sectors                 |
| ☐ Global Load Transactions Per Request         | ☐ fb subp0 write sectors                |