Mike Giles - 1D Black-Scholes on NVIDIA GPUs

1D Black-Scholes on NVIDIA GPUs

This page has CUDA codes for explicit and implicit discretisations of the 1D Black-Scholes PDE on NVIDIA GPUs. The degree of parallelism required on GPUs is achieved by solving multiple 1D problems at the same time. Each problem is solved on one SM (Streaming Multiprocessor) making it possible for a single CUDA kernel to be used for all of the timesteps in the whole computation.

A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014) is available here.



Comments on different kernels within BS_1D_timing.cu:


Results

The first table gives the number of floating point instructions carried out per thread, per timestep. For all of the kernels except for explicit1, this has to be divided by 8 to obtain the operations per grid point, per timestep. This information was obtained by using nvprof, the NVIDIA profiler, on the K20c.

single prec. double prec.
FMA mul add SFU FMA mul add SFU
explicit1 3 0 0 0 3 0 0 0
explicit2 24 0 0 0 24 0 0 0
implicit1 70 65 0 15 115 65 1 15
implicit2 86 73 16 15 131 73 17 15
implicit3 38 15 0 0 38 15 0 0


The four instruction types are FMA (fused multiply-add), multiplication, addition and SFU (special function unit). The last one corresponds to fast reciprocals in single precision, and approximate reciprocals in double precision. Inspecting the code, the operation counts are as expected, except for implicit1 and implicit2 which the compiler optimises impressively by moving some loop-invariant operations out of the main timestep loop, at the expense of some additional registers; this is despite the addition of a spurious time-varying contribution (with zero value) to the main diagonal of the tridiagonal system to try to force it to perform the full computation.

The table below gives execution times in milliseconds for both single precision and double precision computations on a mid-range GTX670 consumer graphics card, and a high-end K20c HPC card. The timing is for the solution of 2048 problems, each on a grid of size 256, with 50000 timesteps for the explicit solvers, and 2500 timesteps for the implicit solvers. It also gives two measures of performance, GFinsts (Giga Floating point instructions per second, based on the instruction data in the table above) and GFlops (Giga Floating point operations, with FMAs counting as 2 operations and the others as 1).

GTX670 K20c
single prec. double prec. single prec. double prec.
msec GFinsts GFlops msec GFinsts GFlops msec GFinsts GFlops msec GFinsts GFlops
explicit1 408 193 385 1354 58 116 324 243 486 381 206 413
explicit2 93 845 1691 1783 44 88 76 1038 2076 158 497 994
implicit1 35 704 1032 574 56 89 28 892 1308 80 401 637
implicit2 40 773 1124 691 56 87 33 948 1377 88 441 685
implicit3 17 503 863 364 24 41 14 643 1103 30 294 505

Accuracy:


Licensing

The codes are freely available to all under a GPL license. They are also provided to NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license for commercial purposes should contact me.


Acknowledgements

This software was developed as part of research at the Oxford e-Research Centre which was partially supported by the ASEArch project on Algorithms and Software for Emerging Architectures (EP/J010553/1), and partly by an Impact Acceleration Award, both funded by the UK Engineering and Physical Sciences Research Council.


EMail

Mike.Giles@maths.ox.ac.uk