Mike Giles - 3D Black-Scholes on NVIDIA GPUs

3D Black-Scholes on NVIDIA GPUs

This project was based on previous work on 1D Black-Scholes solvers, but this is now for a 3D discretisation on a sufficiently large grid so that only one option can be computed at a time. At least one kernel is needed per timestep, so that each thread block can obtain neigbouring data from other thread blocks from the previous timestep.

A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014) is available here.



A general second order discretisation of a 3D convection-diffusion equation can involve a 27-point 3x3x3 stencil, the central point plus the neigbours in each of the three coordinate directions. The PDE being solved here corresponds to 3-factor Geometric Brownian Motion with positive correlation between the three driving Brownian motions -- this leads to a discretisation with a 13-point stencil.

The big concern in implementation is memory bandwidth, primarily the bandwidth from device memory to either texture cache (for read-only accesses) or L1 cache, but also the bandwidth between texture cache and the SMX cores. On the K40, the first of these is about 280GB/s while the second is roughly 1.2TB/s. Hence, if there is a lot of data reuse within the texture cache (i.e. a data item read in from device memory is used repeatedly from the texture cache) then it is the latter which will be the bottleneck.

The compiler will optimise away a lot of the floating point operations by pre-computing terms and saving them in registers. There is also a significant number of integer additions for calculating array indices and 64-bit addresses. Nevertheless, except for double precision on the GTX670, it is not clear whether or not the total compute load is the limiting feature. A back-of-the-envelope estimate is that there are 15 FMAs, 15 FMULs/FADDs and 20-40 integer additions per element, whereas there is 1 read and 1 write from/to device memory, and either 7 or 13 reads from texture cache.


Comments on different kernels:


Results

The table below gives execution times in milliseconds for both single precision and double precision computations on a mid-range GTX670 consumer graphics card, and a high-end K20c HPC card. The timing is for the solution of 1 problem on a grid of size 256x256x256, with 500 timesteps for the explicit solvers, and 100 timesteps for the implicit solvers. It also gives two measures of performance, GFlops and GB/s, both as measured by the visual profiler, nvprof.

GTX670 (CUDA 5.5) K20c (CUDA 5.5) K40 (CUDA 6.0)
single prec. double prec. single prec. double prec. single prec. double prec.
msec GFlops GB/s msec GFlops GB/s msec GFlops GB/s msec GFlops GB/s msec GFlops GB/s msec GFlops GB/s
explicit1 2349 - - 5890 - - 1027 - - 1803 - - 747 597 100 1200 367 127
explicit2 1527 - - 5689 - - 756 - - 1396 - - 600 760 132 923 487 144
explicit3 962 - - 7130 - - 793 - - 1363 - - 552 829 151 1246 346 137
implicit2 1105 - - 4384 - - 656 - - 1503 - - 505 360 130 921 235 139

Notes:


Licensing

The codes are freely available to all under a GPL license. They are also provided to NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license for commercial purposes should contact me.


EMail

Mike.Giles@maths.ox.ac.uk