This project was based on previous work on
1D Black-Scholes solvers,
but this is now for a 3D discretisation on a sufficiently
large grid so that only one option can be computed at a time.
At least one kernel is needed per timestep, so that each thread
block can obtain neigbouring data from other thread blocks from
the previous timestep.
A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014)
is available
here.
A general second order discretisation of a 3D convection-diffusion
equation can involve a 27-point 3x3x3 stencil, the central point
plus the neigbours in each of the three coordinate directions.
The PDE being solved here corresponds to 3-factor Geometric Brownian
Motion with positive correlation between the three driving Brownian
motions -- this leads to a discretisation with a 13-point stencil.
The big concern in implementation is memory bandwidth, primarily
the bandwidth from device memory to either texture cache (for read-only
accesses) or L1 cache, but also the bandwidth between texture cache
and the SMX cores. On the K40, the first of these is about 280GB/s
while the second is roughly 1.2TB/s. Hence, if there is a lot of data
reuse within the texture cache (i.e. a data item read in from device
memory is used repeatedly from the texture cache) then it is the
latter which will be the bottleneck.
The compiler will optimise away a lot of the floating point operations
by pre-computing terms and saving them in registers. There is also a
significant number of integer additions for calculating array indices
and 64-bit addresses.
Nevertheless, except for double precision on the GTX670, it is not
clear whether or not the total compute load is the limiting feature.
A back-of-the-envelope estimate is that there are 15 FMAs,
15 FMULs/FADDs and 20-40 integer additions per element, whereas there
is 1 read and 1 write from/to device memory, and either 7 or 13 reads
from texture cache.
BS_3D_timing.cu -- CUDA code testing
kernels for both explicit and (in the future) implicit time-marching,
with performance tests in both single and double precision
utilities.h -- header file with some utility
routines for reading / writing / incrementing device arrays using
array transposition through shared-memory
trid.h -- same header file used by the
BS_1D codes
transpose.pdf -- some notes explaining
shared memory indices used for on-the-fly data transposition
Comments on different kernels:
bc1 is a linear extrapolation boundary condition used by all of the solvers.
explicit1 is a relatively simple implementation in which each thread handles
a line of grid points in the k-direction, relying on the use of the texture cache
to achieve data reuse.
explicit2 reduces the number of loads from the texture cache by using registers
to store and then reuse some of the values from the previous plane k when moving
on to plane k+1.
explicit3 is based on explicit2 but eliminates a lot of the integer address
calculations by using float2/double2 variables. This gives a significant performance
improvement in single precision on a GTX670, but not on a K20 which is able to load
via the texture cache.
implicit1 will use one kernel to construct the tridiagonal matrices in each
coordinate direction and define the r.h.s., and will then call Jeremy Appleyard's
batched tridiagonal solver to perform the tridiagonal calculations -- this version has not yet been finished.
implicit2 reduces the bandwidth requirements by generating the tridiagonal
matrices on-the-fly. This requires more flops but a lot less communication,
and so wins overall.
Results
The table below gives execution times in milliseconds for both single
precision and double precision computations on a mid-range GTX670
consumer graphics card, and a high-end K20c HPC card. The timing is
for the solution of 1 problem on a grid of size 256x256x256, with
500 timesteps for the explicit solvers, and 100 timesteps for the
implicit solvers. It also gives two measures of performance,
GFlops and GB/s, both as measured by the visual profiler, nvprof.
GTX670 (CUDA 5.5)
K20c (CUDA 5.5)
K40 (CUDA 6.0)
single prec.
double prec.
single prec.
double prec.
single prec.
double prec.
msec
GFlops
GB/s
msec
GFlops
GB/s
msec
GFlops
GB/s
msec
GFlops
GB/s
msec
GFlops
GB/s
msec
GFlops
GB/s
explicit1
2349
-
-
5890
-
-
1027
-
-
1803
-
-
747
597
100
1200
367
127
explicit2
1527
-
-
5689
-
-
756
-
-
1396
-
-
600
760
132
923
487
144
explicit3
962
-
-
7130
-
-
793
-
-
1363
-
-
552
829
151
1246
346
137
implicit2
1105
-
-
4384
-
-
656
-
-
1503
-
-
505
360
130
921
235
139
Notes:
On K20, the best performance for explicit1/2 is when each thread is limited
64 registers, whereas for explicit3 it's best when limited to 80 registers.
Licensing
The codes are freely available to all under a GPL license. They are also provided to
NVIDIA under a permissive BSD license -- anyone else requiring a more permissive license
for commercial purposes should contact me.
Acknowledgements
This software was developed as part of research at the Oxford e-Research
Centre which was partially supported by the ASEArch project on Algorithms
and Software for Emerging Architectures (EP/J010553/1), and partly by an
Impact Acceleration Award, both funded by the UK Engineering and Physical
Sciences Research Council.