This page has CUDA codes for explicit and implicit discretisations of
the 1D Black-Scholes PDE on NVIDIA GPUs. The degree of parallelism
required on GPUs is achieved by solving multiple 1D problems at the same
time. Each problem is solved on one SM (Streaming Multiprocessor)
making it possible for a single CUDA kernel to be used for all of the
timesteps in the whole computation.
A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014)
is available
here.
BS_1D_timing.cu -- CUDA code testing kernels
for both explicit and implicit time-marching, with performance tests in
both single and double precision
explicit1 is a simple "natural" implementation of explicit time-marching with
an entire thread block working on each 1D problem, each thread working on a
single grid point, and shared memory used to exchange data between the threads.
explicit2 is a more complicated implementation with a warp for each 1D problem,
each thread working on 8 grid points, and either warp shuffles or shared memory
used to exchange data at the boundaries between "neighbouring" threads.
implicit1 uses implicit time-marching. It is similar to explicit2, with each
warp handling a separate 1D problem, and each thread working on 8 grid points.
A Thomas-type algorithm is used by each thread to reduce its number of unknowns
to 2, and then PCR (parallel cyclic reduction) is used within the warp to solve
the reduced system. The use of warp shuffles to exchange data between threads
means it will work only on Kepler GPUs.
implicit2 is very similar to implicit1. The only difference is that implicit1
directly calculates the new values, whereas implicit2 computes the changes in
the solution over the timestep, then adds the changes to the old values to get
the new values. This improves the accuracy significantly, but requires more
operations and more registers to hold the old solution while computing the changes.
implicit3 is based on code developed by Julien Demouth (NVIDIA). Mathematically,
it is the same as implicit1, but it exploits the fact that the tridiagonal
matrix does not change from one timestep to the next. Hence, about half of the
computation (and all of the reciprocals) can be done just once in a setup phase,
leaving much less to be done in the main solve phase for each timestep.
Results
The first table gives the number of floating point instructions
carried out per thread, per timestep. For all of the kernels except
for explicit1, this has to be divided by 8 to obtain the operations
per grid point, per timestep. This information was obtained by
using nvprof, the NVIDIA profiler, on the K20c.
single prec.
double prec.
FMA
mul
add
SFU
FMA
mul
add
SFU
explicit1
3
0
0
0
3
0
0
0
explicit2
24
0
0
0
24
0
0
0
implicit1
70
65
0
15
115
65
1
15
implicit2
86
73
16
15
131
73
17
15
implicit3
38
15
0
0
38
15
0
0
The four instruction types are FMA (fused multiply-add),
multiplication, addition and SFU (special function unit).
The last one corresponds to fast reciprocals in single precision,
and approximate reciprocals in double precision. Inspecting the
code, the operation counts are as expected, except for implicit1
and implicit2 which the compiler optimises impressively by moving
some loop-invariant operations out of the main timestep loop, at
the expense of some additional registers; this is despite the
addition of a spurious time-varying contribution (with zero value)
to the main diagonal of the tridiagonal system to try to force it
to perform the full computation.
The table below gives execution times in milliseconds for both single
precision and double precision computations on a mid-range GTX670
consumer graphics card, and a high-end K20c HPC card. The timing is
for the solution of 2048 problems, each on a grid of size 256, with
50000 timesteps for the explicit solvers, and 2500 timesteps for the
implicit solvers. It also gives two measures of performance,
GFinsts (Giga Floating point instructions per second, based on the
instruction data in the table above) and GFlops (Giga Floating point
operations, with FMAs counting as 2 operations and the others as 1).
GTX670
K20c
single prec.
double prec.
single prec.
double prec.
msec
GFinsts
GFlops
msec
GFinsts
GFlops
msec
GFinsts
GFlops
msec
GFinsts
GFlops
explicit1
408
193
385
1354
58
116
324
243
486
381
206
413
explicit2
93
845
1691
1783
44
88
76
1038
2076
158
497
994
implicit1
35
704
1032
574
56
89
28
892
1308
80
401
637
implicit2
40
773
1124
691
56
87
33
948
1377
88
441
685
implicit3
17
503
863
364
24
41
14
643
1103
30
294
505
Accuracy:
The double precision results match the MATLAB results to at least 12 decimal places.
The single precision explicit results have a relative error of approximately 1e-5.
implicit1 and implicit3 have a relative error of approximately 5e-5, but implicit2
has a relative error of about 1e-6. These observations are all consistent with
standard error analysis.
The discretisation error (the difference between the double precision results and the
analytic Black-Scholes solution) is about 1e-4, so the single precision error is less
than this.
Licensing
The codes are freely available to all under a GPL license. They are also provided to
NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license
for commercial purposes should contact me.
Acknowledgements
This software was developed as part of research at the Oxford e-Research
Centre which was partially supported by the ASEArch project on Algorithms
and Software for Emerging Architectures (EP/J010553/1), and partly by an
Impact Acceleration Award, both funded by the UK Engineering and Physical
Sciences Research Council.