Mike Giles - 3D Black-Scholes on NVIDIA GPUs

3D Black-Scholes on NVIDIA GPUs

This project was based on previous work on 1D Black-Scholes solvers, but this is now for a 3D discretisation on a sufficiently large grid so that only one option can be computed at a time. At least one kernel is needed per timestep, so that each thread block can obtain neigbouring data from other thread blocks from the previous timestep.

A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014) is available here.

A general second order discretisation of a 3D convection-diffusion equation can involve a 27-point 3x3x3 stencil, the central point plus the neigbours in each of the three coordinate directions. The PDE being solved here corresponds to 3-factor Geometric Brownian Motion with positive correlation between the three driving Brownian motions -- this leads to a discretisation with a 13-point stencil.

The big concern in implementation is memory bandwidth, primarily the bandwidth from device memory to either texture cache (for read-only accesses) or L1 cache, but also the bandwidth between texture cache and the SMX cores. On the K40, the first of these is about 280GB/s while the second is roughly 1.2TB/s. Hence, if there is a lot of data reuse within the texture cache (i.e. a data item read in from device memory is used repeatedly from the texture cache) then it is the latter which will be the bottleneck.

The compiler will optimise away a lot of the floating point operations by pre-computing terms and saving them in registers. There is also a significant number of integer additions for calculating array indices and 64-bit addresses. Nevertheless, except for double precision on the GTX670, it is not clear whether or not the total compute load is the limiting feature. A back-of-the-envelope estimate is that there are 15 FMAs, 15 FMULs/FADDs and 20-40 integer additions per element, whereas there is 1 read and 1 write from/to device memory, and either 7 or 13 reads from texture cache.

BS_3D_timing.cu -- CUDA code testing kernels for both explicit and (in the future) implicit time-marching, with performance tests in both single and double precision

Makefile

utilities.h -- header file with some utility routines for reading / writing / incrementing device arrays using array transposition through shared-memory

trid.h -- same header file used by the BS_1D codes

transpose.pdf -- some notes explaining shared memory indices used for on-the-fly data transposition

Comments on different kernels:

bc1 is a linear extrapolation boundary condition used by all of the solvers.

explicit1 is a relatively simple implementation in which each thread handles a line of grid points in the k-direction, relying on the use of the texture cache to achieve data reuse.

explicit2 reduces the number of loads from the texture cache by using registers to store and then reuse some of the values from the previous plane k when moving on to plane k+1.

explicit3 is based on explicit2 but eliminates a lot of the integer address calculations by using float2/double2 variables. This gives a significant performance improvement in single precision on a GTX670, but not on a K20 which is able to load via the texture cache.

implicit1 will use one kernel to construct the tridiagonal matrices in each coordinate direction and define the r.h.s., and will then call Jeremy Appleyard's batched tridiagonal solver to perform the tridiagonal calculations -- this version has not yet been finished.

implicit2 reduces the bandwidth requirements by generating the tridiagonal matrices on-the-fly. This requires more flops but a lot less communication, and so wins overall.

Results

The table below gives execution times in milliseconds for both single precision and double precision computations on a mid-range GTX670 consumer graphics card, and a high-end K20c HPC card. The timing is for the solution of 1 problem on a grid of size 256x256x256, with 500 timesteps for the explicit solvers, and 100 timesteps for the implicit solvers. It also gives two measures of performance, GFlops and GB/s, both as measured by the visual profiler, nvprof.

	GTX670 (CUDA 5.5)						K20c (CUDA 5.5)						K40 (CUDA 6.0)
	single prec.			double prec.			single prec.			double prec.			single prec.			double prec.
	msec	GFlops	GB/s	msec	GFlops	GB/s	msec	GFlops	GB/s	msec	GFlops	GB/s	msec	GFlops	GB/s	msec	GFlops	GB/s
explicit1	2349	-	-	5890	-	-	1027	-	-	1803	-	-	747	597	100	1200	367	127
explicit2	1527	-	-	5689	-	-	756	-	-	1396	-	-	600	760	132	923	487	144
explicit3	962	-	-	7130	-	-	793	-	-	1363	-	-	552	829	151	1246	346	137
implicit2	1105	-	-	4384	-	-	656	-	-	1503	-	-	505	360	130	921	235	139

Notes:

On K20, the best performance for explicit1/2 is when each thread is limited 64 registers, whereas for explicit3 it's best when limited to 80 registers.

Licensing

The codes are freely available to all under a GPL license. They are also provided to NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license for commercial purposes should contact me.

Acknowledgements

This software was developed as part of research at the Oxford e-Research Centre which was partially supported by the ASEArch project on Algorithms and Software for Emerging Architectures (EP/J010553/1), and partly by an Impact Acceleration Award, both funded by the UK Engineering and Physical Sciences Research Council.

EMail

Mike.Giles@maths.ox.ac.uk