

# Profiling & Tuning Applications

CUDA Course István Reguly

### Introduction

- Why is my application running slow?
- Work it out on paper
- Instrument code
- Profile it
  - NVIDIA Visual Profiler
    - Works with CUDA, needs some tweaks to work with OpenCL
  - nvprof command line tool, can be used with MPI applications

### Identifying Performance Limiters

- CPU: Setup, data movement
- GPU: Bandwidth, compute or latency limited
- Number of instructions for every byte moved
- Algorithmic analysis gives a good estimate
- Actual code is likely different
  - Instructions for loop control, pointer math, etc.
  - Memory access patterns
  - How to find out?
    - Use the profiler (quick, but approximate)
    - Use source code modification (takes more work)

Analysis with Source Code Modification

- Time memory-only and math-only versions
  - Not so easy for kernels with data-dependent control flow
  - Good to estimate time spent on accessing memory or executing instructions
- Shows whether kernel is memory or compute bound
- Put an "if" statement depending on kernel argument around math/mem instructions
  - Use dynamic shared memory to get the same occupancy

#### Analysis with Source Code Modification

\_\_global\_\_ void kernel(float \*a) { int idx = threadIdx.x + blockDim.x+blockIdx.x; float my\_a; my\_a = a[idx]; for (int i =0; i < 100; i++) my\_a = sinf(my\_a+i\*3.14f); a[idx] = my\_a;

> \_\_global\_\_ void kernel(float \*a, int prof) { int idx = threadIdx.x + blockDim.x+blockIdx.x; float my\_a; if (prof & 1) my\_a = a[idx]; if (prof & 2) for (int i =0; i < 100; i++) my\_a = sinf(my\_a+i\*3.14f); if (prof & 1) a[idx] = my\_a;

#### Example scenarios



### **NVIDIA Visual Profiler**

- Collects metrics and events during execution
  - Calls to the CUDA API
  - Overall application:
    - Memory transfers
    - Kernel launches
  - Kernels
    - Occupancy
    - Computation efficiency
    - Memory bandwidth efficiency
  - Source-level profiling
- Requires deterministic execution!

| Meet the test setup                                                                                                                                                                                                                                                                                                                                                      | 1   | 4  | 7  | 4  | 1 |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|----|----|----|---|--|
|                                                                                                                                                                                                                                                                                                                                                                          | 4   | 16 | 26 | 16 | 4 |  |
| • 2D gaussian blur with a 5x5 stencil $\frac{1}{273}$                                                                                                                                                                                                                                                                                                                    | 7   | 26 | 41 | 26 | 7 |  |
| • 4096^2 grid                                                                                                                                                                                                                                                                                                                                                            | 4   | 16 | 26 | 16 | 4 |  |
| 4090° 2 gnu                                                                                                                                                                                                                                                                                                                                                              | 1   | 4  | 7  | 4  | 1 |  |
| global void stencil_v0(float *input, float *output,<br>int sizex, int sizey) {                                                                                                                                                                                                                                                                                           |     |    |    |    |   |  |
| <pre>const int x = blockIdx.x*blockDim.x + threadIdx.x const int y = blockIdx.y*blockDim.y + threadIdx.y if ((x &gt;= sizex-2)    (y &gt;= sizey-2)) return; float accum = 0.0f; for (int i = -2; i &lt; 2; i++) {     for (int j = -2; j &lt; 2; j++) {         accum += filter[i+2][j+2]*input[sizey*(y+j)     (x+i)];     } } output[sizey*y+x] = accum/273.0f;</pre> | + 2 |    |    |    |   |  |
| }                                                                                                                                                                                                                                                                                                                                                                        |     |    |    |    |   |  |

### Meet the test setup

#### • NVIDIA K40

- GK110B
- SM 3.5
- ECC on
- Graphics clocks at 745MHz, Memory clocks at 3004MHz

#### • CUDA 9.0

nvcc profiling\_lecture.cu -O2 -arch=sm\_35 -I. -lineinfo -DIT=0

## Launch a profiling session

| Connection:        | Local                                                                                             | nastions |
|--------------------|---------------------------------------------------------------------------------------------------|----------|
| Toolkit/Script:    | Local Manage con<br>CUDA Toolkit 9.0 (/panfs/pan01/system/software/arcus-b/qpu/cuda/9.0.176/bin/) | Manage   |
| File:              | ./a.out                                                                                           | Browse   |
| Working directory: | Enter working directory [optional]                                                                | Browse   |
| Arguments:         | Enter command-line arguments                                                                      |          |
|                    | Profile child processes                                                                           | [▼]      |
| Environment:       | Name Value                                                                                        | Add      |
|                    |                                                                                                   | Delete   |
|                    |                                                                                                   |          |
|                    |                                                                                                   |          |
|                    |                                                                                                   |          |
|                    |                                                                                                   |          |

### Interactive demo of tuning process



## First look



### The Timeline



### Examine Individual Kernels

| 1 | The fo | rnel Optimization Priorities<br>llowing kernels are ordered by optimization importance based on execution time and achieved occupancy. Optimization of higher ranked kernel:<br>) is more likely to improve performance compared to lower ranked kernels. |
|---|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|   | Rank   | Description                                                                                                                                                                                                                                               |
|   | 10     | [ 1 kernel instances ] stencil_v0(float*, float*, int, int)                                                                                                                                                                                               |
|   |        |                                                                                                                                                                                                                                                           |
|   |        |                                                                                                                                                                                                                                                           |
|   |        | rnels sorted by total execution time: the higher the rank the higher the optimisation on overall performance                                                                                                                                              |

Initial unoptimised (v0) 8.25ms

Analysis

|                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Guided                                                                                           |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| 🖬 Analysis 🕱 📮 Console 🔚 Settings 🗔                                                                                                                                                                                                                                                                                                                                                                                                              | Details                                                                                          |
| 🔚 🗄 🔛 🛄 Export PDF Report                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                  |
| 1. CUDA Application Analysis                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                  |
| The guided analysis system walks you through<br>analysis stages to help you understand the or<br>opportunities in your application. Once you be<br>with the optimization process, you can explore<br>analysis stages in an unguided mode. When o<br>application it is important to fully utilize the co<br>data movement capabilities of the GPU. To do<br>look at your application's overail GPU usage as<br>performance of Individual kernels. | otimization<br>come familiar<br>the individual<br>ptimizing your<br>mpute and<br>this you should |
| 👪 Examine GPU Usage                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                  |
| Determine your application's overall GPU usage. Ti<br>requires an application timeline, so your applicatio<br>once to collect it if it is not already available.                                                                                                                                                                                                                                                                                 |                                                                                                  |
| 🖳 Examine Individual Kernels                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                  |
| Determine which kernels are the most performanc<br>have the most opportunity for improvement. This a<br>utilization data from every kernel, so your applicati<br>once to collect that data if it is not already available                                                                                                                                                                                                                        | analysis requires<br>on will be run                                                              |
| 🛺 Delete Existing Analysis Informa                                                                                                                                                                                                                                                                                                                                                                                                               | ation                                                                                            |
| If the application has changed since the last analys<br>existing analysis information may be stale and sho<br>before continuing.                                                                                                                                                                                                                                                                                                                 |                                                                                                  |
| 👞 Switch to unguided analysis                                                                                                                                                                                                                                                                                                                                                                                                                    | s                                                                                                |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                  |

| Unguided                                   |             |
|--------------------------------------------|-------------|
| 🖩 Analysis 🕱 🛛 📮 Console 🗔 Settings 🗔 Deta | ails        |
| 🗄 🗄 Reset All 🛄 Analyze All                |             |
| VoteAnyKernel1(unsigned int*, unsigned int | *, int)     |
| Kernel Performance Limiter                 | ⊘           |
| Kernel Latency                             | <u>II</u> 🔍 |
| Kernel Compute                             | ⊘           |
| Kernel Memory                              | <u>ii</u> 📀 |
| Global Memory Access Pattern               | <u>ii</u> 📀 |
| Shared Memory Access Pattern               | <u>ii</u> 📀 |
| Divergent Execution                        | <u>_</u>    |
| Kernel Profile                             | <u>II</u>   |
| Application                                |             |
| Data Movement And Concurrency              | <b>o</b>    |
|                                            |             |

### Utilisation – Warp Issue Efficiency & Pipe Utilisation

#### i Kernel Performance Is Bound By Instruction And Memory Latency

This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of "Tesla + utilization levels indicate that the performance of the kernel is most likely limited by the latency of arithmetic or memory o Achieved compute throughput and/or memory bandwidth below 60% of peak typically indicates latency issues.



### Latency analysis



#### Memory Bandwidth analysis

Result

|                   | Transactions | Bandwidth    |      |     | Utilization |        |     |          |
|-------------------|--------------|--------------|------|-----|-------------|--------|-----|----------|
| 1/Shared Memory   |              |              |      |     |             |        |     |          |
| _ocal Loads       | 0            | 0 B/s        |      |     |             |        |     |          |
| _ocal Stores      | 0            | 0 B/s        |      |     |             |        |     |          |
| Shared Loads      | 0            | 0 B/s        |      |     |             |        |     |          |
| Shared Stores     | 0            | 0 B/s        |      |     |             |        |     |          |
| Global Loads      | 40894464     | 248.782 GB/s | 4    |     |             |        |     | L1 cache |
| Global Stores     | 2621440      | 16.585 GB/s  |      |     |             |        |     |          |
| Atomic            | 0            | 0 B/s        |      |     |             |        |     | not used |
| 1/Shared Total    | 43515904     | 265.367 GB/s | Idle | Low | Medium      | High   | Max | not usea |
| 2 Cache           |              |              |      |     | _           |        |     |          |
| 1 Reads           | 62914560     | 248.782 GB/s |      |     |             |        |     |          |
| .1 Writes         | 4194304      | 16.585 GB/s  |      |     |             |        |     |          |
| exture Reads      | 0            | 0 B/s        |      | /   |             |        |     |          |
| Atomic            | 0            | 0 B/s        |      |     |             |        |     |          |
| Noncoherent Reads | 0            | 0 B/s        |      |     |             |        |     |          |
| otal              | 67108864     | 265.367 GB/s | Idle | Low | Medium      | High   | Max |          |
| exture Cache      |              |              |      |     |             |        |     |          |
| Reads             | 0            | 0 B/s        | Idle | Low | Medium      | High   | Мах |          |
| Device Memory     |              |              |      |     |             |        |     |          |
| Reads             | 3756909      | 14.856 GB/s  |      |     |             |        |     |          |
| Writes            | 2904475      | 11.485 GB/s  |      |     |             |        |     |          |
| Total             | 6661384      | 26.341 GB/s  | Idle | Low | Medium      | High   | Max |          |
| ECC Overhead      | 2451525      | 9.694 GB/s   |      |     |             | . ngri |     |          |

### Investigate further...

| Unguide                                    | ed            |                                                                                                                                       |  |  |  |  |  |  |  |
|--------------------------------------------|---------------|---------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|
| 🕞 Analysis 🕅 🔄 Details 📮 Console 🗔 Setting | IS            |                                                                                                                                       |  |  |  |  |  |  |  |
| 🗄 🖬 🚹 🖪 Reset All 📠 Analyze All            | Results       |                                                                                                                                       |  |  |  |  |  |  |  |
| stencil_v0(float*, float*, int, int)       | -             | emory Alignment and Access Pattern<br>lwidth is used most efficiently when each global memory load and store has proper alignment and |  |  |  |  |  |  |  |
| Kernel Performance Limiter 🛛 🤣             |               | Optimization: Select each entry below to open the source code to a global load or store within the kernel with                        |  |  |  |  |  |  |  |
| Kernel Latency                             | access patter | access pattern. For each load or store improve the alignment and access pattern of the memory access.                                 |  |  |  |  |  |  |  |
| Kernel Compute 🥥                           | ▼ Line / File | profiling_lecture.cu - /home/mgiles/ireguly/cuda_course                                                                               |  |  |  |  |  |  |  |
| Kernel Memory                              | 25            | Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transacti                                          |  |  |  |  |  |  |  |
|                                            | 25            | Global Load L2 Transactions/Access = 6, Ideal Transactions/Access = 4 [ 3145728 L2 transacti                                          |  |  |  |  |  |  |  |
| Global Memory Access Pattern 🛛 📀           | 25            | Global Load L2 Transactions/Access = 6, Ideal Transactions/Access = 4 [ 3145728 L2 transacti                                          |  |  |  |  |  |  |  |
|                                            | 25            | Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transacti                                          |  |  |  |  |  |  |  |
| Shared Memory Access Pattern 🛛 🤡           | 25            | Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transacti                                          |  |  |  |  |  |  |  |
| Divergent Execution                        | 25            | Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transacti                                          |  |  |  |  |  |  |  |
| Divergent Execution                        | 25            | Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transacti                                          |  |  |  |  |  |  |  |

6-8 transactions per access – something is wrong with how we access memory

Global memory load efficiency 53.3% L2 hit rate 96.7%

#### Iteration 1 – turn on L1



25 Global Load L2 Transactions/Access = 20, Ideal Transactions/Access = 4 [ 10485760 L2 transactions for 524288 total ex

25 Global Load L2 Transactions/Access = 18, Ideal Transactions/Access = 4 [ 9437184 L2 transactions for 524288 total exe

Memory unit is utilized, but Global Load efficiency became even worse: 20.5%

| Enable L1 6.57ms |
|------------------|

### Cache line utilization



### Cache line utilization



128 bytes (32 floats) Unit of transaction

Each time a transaction requires more than 1 128B cache line: re-issue

L1 cache enabled:

-> 128B transactions

-> 4\*32B to L2

Min 16, Max 32 transactions



#### Iteration 2 – 32x2 blocks



Memory utilization decreased 10% Performance almost doubles Global Load Efficiency 50.8%

▼ Line / File profiling\_lecture.cu - /home/mgiles/ireguly/cuda\_course

25

- 25 Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 transactions for 524288 total exec 25 Global Load L2 Transactions/Access = 7.5, Ideal Transactions/Access = 4 [ 3932160 L2 transactions for 524288 total exec
  - Global Load LZ transactions/Access = 7.5, Ideal transactions/Access = 4 [ 3932100 L2 transactions for 524288 total ex

| Global Load L2 Transactions/Access = 8, Ideal Transactions/Access = 4 [ 4194304 L2 | transactions for 524288 total exec |
|------------------------------------------------------------------------------------|------------------------------------|
|------------------------------------------------------------------------------------|------------------------------------|

| 8.25ms |        |
|--------|--------|
| 6.57ms |        |
| 3.4ms  |        |
|        | 6.57ms |

### Key takeaway

- Latency/Bandwidth bound
- Inefficient use of memory system and bandwidth
- Symptoms:
  - Lots of transactions per request (low load efficiency)
- Goal:
  - Use the whole cache line
  - Improve memory access patterns (coalescing)
- What to do:
  - Align data, change block size, change data layout
  - Use shared memory/shuffles to load efficiently

Ontimization: Increase the number of threads in each block to increase the number of warps that can execute on each SM

#### Latency analysis



#### Latency analysis

| Variable         | Achieved | Theoretical | Device Limit | Grid Si  | ze:[ | 128,20  | 48,1]  | (2621 | 44 bl | ocks)E | Block S | Size:[3 |
|------------------|----------|-------------|--------------|----------|------|---------|--------|-------|-------|--------|---------|---------|
| Occupancy Per SM |          |             |              |          |      |         |        |       |       |        |         |         |
| Active Blocks    |          | 16          | 16           | 0        | 2    | 4       | 6      | 8     | 10    | 12     | 14      | 16      |
| Active Warps     | 26.67    | 32          | 64           | 0        | 9    | 18      | 27     | 7 30  | 6     | 45     | 54      | 6534    |
| Active Threads   |          | 1024        | 2048         | 0 512 10 |      | 1024    | 4 1536 |       | 6     | 2048   |         |         |
| Occupancy        | 41.7%    | 50%         | 100%         | 0% 25% 5 |      | 50% 75% |        | 6     | 100%  |        |         |         |
| Warps            |          |             |              |          |      |         |        |       |       |        |         |         |
| Threads/Block    |          | 64          | 1024         | 0        |      | 256     |        | 512   |       | 768    |         | 1024    |
| Warps/Block      |          | 2           | 32           | 0        | 4    | 8       | 12     | 16    | 20    | 24     | 28      | 32      |
| Block Limit      |          | 32          | 16           |          | 2    | 4       | 6      | 8     | 10    | 12     | 14      | 16      |

## Latency analysis



Increase the block size so more warps can be active at the same time.

Kepler: Max 16 blocks per SM Max 2048 threads per SM



### Occupancy – using all "slots"



### Key takeaway

- Latency bound low occupancy
- Unused cycles, exposed latency
- Symptoms:
  - High execution/memory dependency, low occupancy
- Goal:
  - Better utilise cycles by: having more warps
- What to do:
  - Determine occupancy limiter (registers, block size, shared memory) and vary it

### Iteration 3 – 32x4 blocks



### Improving memory bandwidth

- L1 is fast, but a bit wasteful (128B loads)
  - 8 transactions on average (minimum would be 4)

High

Med

Load/Store

Arithmetic

Control-Floy

Textur

Utilization Level

- Load/Store pipe stressed
  - Any way to reduce the load?
- Texture cache
  - Dedicated pipeline
  - 32 byte loads
  - const \_\_restrict\_\_ \*
  - \_\_ldg()

#### Iteration 4 – texture cache





Compute utilization could be higher (~78%) Lots of Integer & memory instructions, fewer FP Integer ops have lower throughput than FP Try to amortize the cost: increase compute per byte

### Key takeaway

- Bandwidth bound Load/Store Unit
- LSU overutilised
- Symptoms:
  - LSU pipe utilisation high, others low
- Goal:
  - Better spread the load between other pipes: use TEX
- What to do:
  - Read read-only data through the texture cache
  - const \_\_restrict\_\_ or \_\_ldg()

#### Instruction Level Parallelism



• Remember, GPU is in-order:



- Second instruction cannot be issued before first
  - But it can be issued before the first finishes if there is no dependency
- Applies to memory instructions too latency much higher (counts towards stall reasons)

### Instruction Level Parallelism

| <pre>for (j=0;j&lt;2;j++)   acc+=filter[j]*input[x+j];</pre> |
|--------------------------------------------------------------|
| tmp=input[x+0]                                               |
| acc += filter[0]*tmp                                         |
| tmp=input[x+1]                                               |
| acc += filter[1]*tmp                                         |

#pragma unroll can help ILP Create two accumulators Or...



Process 2 points per thread Bonus data re-use (register caching)

### Iteration 5 – 2 points per thread



| Initial unoptimised (v0) | 8.25ms |  |
|--------------------------|--------|--|
| Texture cache            | 1.53ms |  |
| 2 points                 | 1.07ms |  |

### Key takeaway

- Latency bound low instruction level parallelism
- Unused cycles, exposed latency
- Symptoms:
  - High execution dependency, one "pipe" saturated
- Goal:
  - Better utilise cycles by: increasing parallel work per thread
- What to do:
  - Increase ILP by having more independent work, e.g. more than 1 output value per thread
  - #pragma unroll

### Iteration 6 – 4 points per thread



| Initial unoptimised (v0) | 8.25ms |  |
|--------------------------|--------|--|
| 2 points                 | 1.07ms |  |
| 4 points                 | 0.95ms |  |

### Checklist

- cudaDeviceSynchronize()
  - Most API calls (e.g. kernel launch) are asynchronous
  - Overhead when launching kernels
  - Get rid of cudaDeviceSynchronize() to hide this latency
  - Timing: events or callbacks CUDA 5.0+
- Cache config 16/48 or 48/16 kB L1/shared (default is 48k shared!) on Kepler
  - cudaSetDeviceCacheConfig
  - cudaFuncSetCacheConfig
  - Check if shared memory usage is a limiting factor

### Checklist

- Occupancy
  - Max 2048 threads or 16 blocks per SM on Kepler
  - Limited amount of registers and shared memory
    - Max 255 registers/thread, rest is spilled to global memory
    - You can explicitly limit it (-maxregcount=xx)
    - 48kB/16kB shared/L1: don't forget to set it
  - Visual Profiler tells you what is the limiting factor
  - In some cases though, it is faster if you don't maximise it (see Volkov paper) -> Autotuning!

### Verbose compile

• Add –Xptxas=-v

ptxas info : Compiling entry function '\_Z10fem\_kernelPiS\_' for 'sm\_20' ptxas info : Function properties for \_Z10fem\_kernelPiS\_ 856 bytes stack frame, 980 bytes spill stores, 1040 bytes spill loads ptxas info : Used 63 registers, 96 bytes cmem[0]

Check profiler figures for best occupancy

### Checklist

- Precision mix (e.g. 1.0 vs 1.0f) cuobjdump
  - F2F.F64.F32 (6\* the cost of a multiply)
  - IEEE standard: always convert to higher precision
  - Integer multiplications are now expensive (6\*)
- cudaMemcpy
  - Introduces explicit synchronisation, high latency
  - Is it necessary?
    - May be cheaper to launch a kernel which immediately exits
  - Could it be asynchronous? (Pin the memory!)

### Auto-tuning

- Several parameters that affect performance
  - Block size
  - Amount of work per block
  - Application specific
- Which combination performs the best?
- Auto-tuning with Flamingo
  - #define/read the sizes, recompile/rerun combinations

### Auto-tuning Case Study

- Thread cooperation on sparse matrix-vector product
  - Multiple threads doing partial dot product on the row
  - Reduction in shared memory
- Auto-tune for different matrices
  - Difficult to predict caching behavior
  - Develop a heuristic for cooperation vs. average row length

### Autotuning Case Study





- Iterative approach to improving a code's performance
  - Identify hotspot
  - Find performance limiter, understand why it's an issue
  - Improve your code
  - Repeat
- Managed to achieve a 8.5x speedup
- Shown how NVVP guides us and helps understand what the code does
- There is more it can show...

References: C. Angerer, J. Demouth, "CUDA Optimization with NVIDIA Nsight Eclipse Edition", GTC 2015



## Rapid code development with Thrust

#### Thrust

- Open High-Level Parallel Algorithms Library
- Parallel Analog of the C++ Standard Template Library (STL)
  - Vector containers
  - Algorithms
- Comes with the toolkit
- Productive way to use CUDA

#### Example

#include <thrust/host\_vector.h>
#include <thrust/device\_vector.h>
#include <thrust/sort.h>
#include <cstdlib>

#### int main(void)

{
 // generate 32M random numbers on the host
 thrust::host\_vector<int> h\_vec(32 << 20);
 thrust::generate(h\_vec.begin(), h\_vec.end(), rand);</pre>

// transfer data to the device
thrust::device\_vector<int> d\_vec = h\_vec;

// sort data on the device
thrust::sort(d\_vec.begin(), d\_vec.end());

// transfer data back to host
thrust::copy(d\_vec.begin(), d\_vec.end(), h\_vec.begin());

return 0;

3

### Productivity

- Containers
  - host\_vector
  - device\_vector
- Memory management
  - Allocation, deallocation
  - Transfers
- Algorithm selection
  - Location is implicit

// allocate host vector with two elements
thrust::host\_vector<int> h\_vec(2);

// copy host data to device memory
thrust::device\_vector<int> d\_vec = h\_vec;

// write device values from the host
d\_vec[0] = 27;
d\_vec[1] = 13;

// read device values from the host
int sum = d\_vec[0] + d\_vec[1];
// invoke algorithm on device
thrust::sort(d\_vec.begin(), d\_vec.end());

### Productivity

- Large set of algorithms
  - ~100 functions
  - CPU, GPU

| • [ | =le> | kib | le |
|-----|------|-----|----|
|-----|------|-----|----|

- C++ templates
- User-defined types
- User-defined operators

| Algorithm     | Description                               |
|---------------|-------------------------------------------|
| reduce        | Sum of a sequence                         |
| find          | First position of a value in a sequence   |
| mismatch      | First position where two sequences differ |
| count         | Number of instances of a value            |
| inner_product | Dot product of two sequences              |
| merge         | Merge two sorted sequences                |

### Portability

- Implementations
  - CUDA C/C++
  - Threading Building Blocks
  - OpenMP
  - Interoperable with anything CUDA based
- Recompile
- Mix backends

nvcc -DTHRUST\_DEVICE\_SYSTEM=THRUST\_HOST\_SYSTEM\_OMP

thrust::omp::vector<float> my\_omp\_vec(100); thrust::cuda::vector<float> my\_cuda\_vec(100);

### Interoperability

- Thrust containers and raw pointers
  - Use container in CUDA kernel

• Use a device pointer in thrust algorithms (not a vector though, no begin(), end(), resize() etc.)

int \*dev\_ptr; cudaMalloc((void\*\*)&dev\_ptr, 100\*sizeof(int));

thrust::device\_ptr<int> dev\_ptr\_thrust(dev\_ptr);
thrust::fill(dev\_ptr\_thrust, dev\_ptr\_thrust+100, 0);

## Thrust

- Constantly evolving
- Reliable comes with the toolkit, tested every day with unit tests
- Performance specialised implementations for different hardware
- Extensible allocators, back-ends, etc.





# Thrust documentation

http://thrust.github.io/doc/modules.html

