This page has CUDA codes for explicit and implicit discretisations of
the 1D Black-Scholes PDE on NVIDIA GPUs. The degree of parallelism
required on GPUs is achieved by solving multiple 1D problems at the same
time. Each problem is solved on one SM (Streaming Multiprocessor)
making it possible for a single CUDA kernel to be used for all of the
timesteps in the whole computation.
A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014)
BS_1D_timing.cu -- CUDA code testing kernels
for both explicit and implicit time-marching, with performance tests in
both single and double precision
explicit1 is a simple "natural" implementation of explicit time-marching with
an entire thread block working on each 1D problem, each thread working on a
single grid point, and shared memory used to exchange data between the threads.
explicit2 is a more complicated implementation with a warp for each 1D problem,
each thread working on 8 grid points, and either warp shuffles or shared memory
used to exchange data at the boundaries between "neighbouring" threads.
implicit1 uses implicit time-marching. It is similar to explicit2, with each
warp handling a separate 1D problem, and each thread working on 8 grid points.
A Thomas-type algorithm is used by each thread to reduce its number of unknowns
to 2, and then PCR (parallel cyclic reduction) is used within the warp to solve
the reduced system. The use of warp shuffles to exchange data between threads
means it will work only on Kepler GPUs.
implicit2 is very similar to implicit1. The only difference is that implicit1
directly calculates the new values, whereas implicit2 computes the changes in
the solution over the timestep, then adds the changes to the old values to get
the new values. This improves the accuracy significantly, but requires more
operations and more registers to hold the old solution while computing the changes.
implicit3 is based on code developed by Julien Demouth (NVIDIA). Mathematically,
it is the same as implicit1, but it exploits the fact that the tridiagonal
matrix does not change from one timestep to the next. Hence, about half of the
computation (and all of the reciprocals) can be done just once in a setup phase,
leaving much less to be done in the main solve phase for each timestep.
The first table gives the number of floating point instructions
carried out per thread, per timestep. For all of the kernels except
for explicit1, this has to be divided by 8 to obtain the operations
per grid point, per timestep. This information was obtained by
using nvprof, the NVIDIA profiler, on the K20c.
The four instruction types are FMA (fused multiply-add),
multiplication, addition and SFU (special function unit).
The last one corresponds to fast reciprocals in single precision,
and approximate reciprocals in double precision. Inspecting the
code, the operation counts are as expected, except for implicit1
and implicit2 which the compiler optimises impressively by moving
some loop-invariant operations out of the main timestep loop, at
the expense of some additional registers; this is despite the
addition of a spurious time-varying contribution (with zero value)
to the main diagonal of the tridiagonal system to try to force it
to perform the full computation.
The table below gives execution times in milliseconds for both single
precision and double precision computations on a mid-range GTX670
consumer graphics card, and a high-end K20c HPC card. The timing is
for the solution of 2048 problems, each on a grid of size 256, with
50000 timesteps for the explicit solvers, and 2500 timesteps for the
implicit solvers. It also gives two measures of performance,
GFinsts (Giga Floating point instructions per second, based on the
instruction data in the table above) and GFlops (Giga Floating point
operations, with FMAs counting as 2 operations and the others as 1).
The double precision results match the MATLAB results to at least 12 decimal places.
The single precision explicit results have a relative error of approximately 1e-5.
implicit1 and implicit3 have a relative error of approximately 5e-5, but implicit2
has a relative error of about 1e-6. These observations are all consistent with
standard error analysis.
The discretisation error (the difference between the double precision results and the
analytic Black-Scholes solution) is about 1e-4, so the single precision error is less
The codes are freely available to all under a GPL license. They are also provided to
NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license
for commercial purposes should contact me.