A paper presented at WHPCF'14 (Workshop on HPC for Finance, 2014) is available here.

- BS_1D_timing.cu -- CUDA code testing kernels for both explicit and implicit time-marching, with performance tests in both single and double precision
- trid.h -- header file with tridiagonal solvers
- warp_trid.cu -- code to test some solvers in trid.h
- Makefile
- BS_1D.m -- MATLAB code to generate model answers for both explicit and implicit time-marching

Comments on different kernels within BS_1D_timing.cu:

- explicit1 is a simple "natural" implementation of explicit time-marching with an entire thread block working on each 1D problem, each thread working on a single grid point, and shared memory used to exchange data between the threads.
- explicit2 is a more complicated implementation with a warp for each 1D problem, each thread working on 8 grid points, and either warp shuffles or shared memory used to exchange data at the boundaries between "neighbouring" threads.
- implicit1 uses implicit time-marching. It is similar to explicit2, with each warp handling a separate 1D problem, and each thread working on 8 grid points. A Thomas-type algorithm is used by each thread to reduce its number of unknowns to 2, and then PCR (parallel cyclic reduction) is used within the warp to solve the reduced system. The use of warp shuffles to exchange data between threads means it will work only on Kepler GPUs.
- implicit2 is very similar to implicit1. The only difference is that implicit1 directly calculates the new values, whereas implicit2 computes the changes in the solution over the timestep, then adds the changes to the old values to get the new values. This improves the accuracy significantly, but requires more operations and more registers to hold the old solution while computing the changes.
- implicit3 is based on code developed by Julien Demouth (NVIDIA). Mathematically, it is the same as implicit1, but it exploits the fact that the tridiagonal matrix does not change from one timestep to the next. Hence, about half of the computation (and all of the reciprocals) can be done just once in a setup phase, leaving much less to be done in the main solve phase for each timestep.

single prec. | double prec. | |||||||
---|---|---|---|---|---|---|---|---|

FMA | mul | add | SFU | FMA | mul | add | SFU | |

explicit1 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |

explicit2 | 24 | 0 | 0 | 0 | 24 | 0 | 0 | 0 |

implicit1 | 70 | 65 | 0 | 15 | 115 | 65 | 1 | 15 |

implicit2 | 86 | 73 | 16 | 15 | 131 | 73 | 17 | 15 |

implicit3 | 38 | 15 | 0 | 0 | 38 | 15 | 0 | 0 |

The four instruction types are FMA (fused multiply-add), multiplication, addition and SFU (special function unit). The last one corresponds to fast reciprocals in single precision, and approximate reciprocals in double precision. Inspecting the code, the operation counts are as expected, except for implicit1 and implicit2 which the compiler optimises impressively by moving some loop-invariant operations out of the main timestep loop, at the expense of some additional registers; this is despite the addition of a spurious time-varying contribution (with zero value) to the main diagonal of the tridiagonal system to try to force it to perform the full computation.

The table below gives execution times in milliseconds for both single precision and double precision computations on a mid-range GTX670 consumer graphics card, and a high-end K20c HPC card. The timing is for the solution of 2048 problems, each on a grid of size 256, with 50000 timesteps for the explicit solvers, and 2500 timesteps for the implicit solvers. It also gives two measures of performance, GFinsts (Giga Floating point instructions per second, based on the instruction data in the table above) and GFlops (Giga Floating point operations, with FMAs counting as 2 operations and the others as 1).

GTX670 | K20c | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

single prec. | double prec. | single prec. | double prec. | |||||||||

msec | GFinsts | GFlops | msec | GFinsts | GFlops | msec | GFinsts | GFlops | msec | GFinsts | GFlops | |

explicit1 | 408 | 193 | 385 | 1354 | 58 | 116 | 324 | 243 | 486 | 381 | 206 | 413 |

explicit2 | 93 | 845 | 1691 | 1783 | 44 | 88 | 76 | 1038 | 2076 | 158 | 497 | 994 |

implicit1 | 35 | 704 | 1032 | 574 | 56 | 89 | 28 | 892 | 1308 | 80 | 401 | 637 |

implicit2 | 40 | 773 | 1124 | 691 | 56 | 87 | 33 | 948 | 1377 | 88 | 441 | 685 |

implicit3 | 17 | 503 | 863 | 364 | 24 | 41 | 14 | 643 | 1103 | 30 | 294 | 505 |

Accuracy:

- The double precision results match the MATLAB results to at least 12 decimal places.
- The single precision explicit results have a relative error of approximately 1e-5. implicit1 and implicit3 have a relative error of approximately 5e-5, but implicit2 has a relative error of about 1e-6. These observations are all consistent with standard error analysis.
- The discretisation error (the difference between the double precision results and the analytic Black-Scholes solution) is about 1e-4, so the single precision error is less than this.