Mike Giles - 1D Black-Scholes on NVIDIA GPUs

1D Black-Scholes on NVIDIA GPUs

This page has CUDA codes for explicit and implicit discretisations of the 1D Black-Scholes PDE on NVIDIA GPUs. The degree of parallelism required on GPUs is achieved by solving multiple 1D problems at the same time. Each problem is solved on one SM (Streaming Multiprocessor) making it possible for a single CUDA kernel to be used for all of the timesteps in the whole computation.

A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014) is available here.

BS_1D_timing.cu -- CUDA code testing kernels for both explicit and implicit time-marching, with performance tests in both single and double precision

trid.h -- header file with tridiagonal solvers

warp_trid.cu -- code to test some solvers in trid.h

Makefile

BS_1D.m -- MATLAB code to generate model answers for both explicit and implicit time-marching

Comments on different kernels within BS_1D_timing.cu:

explicit1 is a simple "natural" implementation of explicit time-marching with an entire thread block working on each 1D problem, each thread working on a single grid point, and shared memory used to exchange data between the threads.

explicit2 is a more complicated implementation with a warp for each 1D problem, each thread working on 8 grid points, and either warp shuffles or shared memory used to exchange data at the boundaries between "neighbouring" threads.

implicit1 uses implicit time-marching. It is similar to explicit2, with each warp handling a separate 1D problem, and each thread working on 8 grid points. A Thomas-type algorithm is used by each thread to reduce its number of unknowns to 2, and then PCR (parallel cyclic reduction) is used within the warp to solve the reduced system. The use of warp shuffles to exchange data between threads means it will work only on Kepler GPUs.

implicit2 is very similar to implicit1. The only difference is that implicit1 directly calculates the new values, whereas implicit2 computes the changes in the solution over the timestep, then adds the changes to the old values to get the new values. This improves the accuracy significantly, but requires more operations and more registers to hold the old solution while computing the changes.

implicit3 is based on code developed by Julien Demouth (NVIDIA). Mathematically, it is the same as implicit1, but it exploits the fact that the tridiagonal matrix does not change from one timestep to the next. Hence, about half of the computation (and all of the reciprocals) can be done just once in a setup phase, leaving much less to be done in the main solve phase for each timestep.

Results

The first table gives the number of floating point instructions carried out per thread, per timestep. For all of the kernels except for explicit1, this has to be divided by 8 to obtain the operations per grid point, per timestep. This information was obtained by using nvprof, the NVIDIA profiler, on the K20c.

	single prec.				double prec.
	FMA	mul	add	SFU	FMA	mul	add	SFU
explicit1	3	0	0	0	3	0	0	0
explicit2	24	0	0	0	24	0	0	0
implicit1	70	65	0	15	115	65	1	15
implicit2	86	73	16	15	131	73	17	15
implicit3	38	15	0	0	38	15	0	0

The four instruction types are FMA (fused multiply-add), multiplication, addition and SFU (special function unit). The last one corresponds to fast reciprocals in single precision, and approximate reciprocals in double precision. Inspecting the code, the operation counts are as expected, except for implicit1 and implicit2 which the compiler optimises impressively by moving some loop-invariant operations out of the main timestep loop, at the expense of some additional registers; this is despite the addition of a spurious time-varying contribution (with zero value) to the main diagonal of the tridiagonal system to try to force it to perform the full computation.

The table below gives execution times in milliseconds for both single precision and double precision computations on a mid-range GTX670 consumer graphics card, and a high-end K20c HPC card. The timing is for the solution of 2048 problems, each on a grid of size 256, with 50000 timesteps for the explicit solvers, and 2500 timesteps for the implicit solvers. It also gives two measures of performance, GFinsts (Giga Floating point instructions per second, based on the instruction data in the table above) and GFlops (Giga Floating point operations, with FMAs counting as 2 operations and the others as 1).

	GTX670						K20c
	single prec.			double prec.			single prec.			double prec.
	msec	GFinsts	GFlops	msec	GFinsts	GFlops	msec	GFinsts	GFlops	msec	GFinsts	GFlops
explicit1	408	193	385	1354	58	116	324	243	486	381	206	413
explicit2	93	845	1691	1783	44	88	76	1038	2076	158	497	994
implicit1	35	704	1032	574	56	89	28	892	1308	80	401	637
implicit2	40	773	1124	691	56	87	33	948	1377	88	441	685
implicit3	17	503	863	364	24	41	14	643	1103	30	294	505

Accuracy:

The double precision results match the MATLAB results to at least 12 decimal places.

The single precision explicit results have a relative error of approximately 1e-5. implicit1 and implicit3 have a relative error of approximately 5e-5, but implicit2 has a relative error of about 1e-6. These observations are all consistent with standard error analysis.

The discretisation error (the difference between the double precision results and the analytic Black-Scholes solution) is about 1e-4, so the single precision error is less than this.

Licensing

The codes are freely available to all under a GPL license. They are also provided to NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license for commercial purposes should contact me.

Acknowledgements

This software was developed as part of research at the Oxford e-Research Centre which was partially supported by the ASEArch project on Algorithms and Software for Emerging Architectures (EP/J010553/1), and partly by an Impact Acceleration Award, both funded by the UK Engineering and Physical Sciences Research Council.

EMail

Mike.Giles@maths.ox.ac.uk