Efficient parallel random number generation is a basic requirement of Monte Carlo simulation.
So far, I have implemented two generators:
I am currently working with NAG to incorporate these into a new set of numerical routines for GPUs available free to academics subject to them signing a collaborative agreement.
In the summer of 2007, a visiting student, Su Xiaoke, worked with me to investigate the performance of NVIDIA GPUs on a LIBOR market model Monte Carlo application. Using an NVIDIA 8800 GTX graphics card with 128 cores, we achieved a speedup of over 100 relative to a single Xeon core.
An updated version of this code uses the mrg32k3a random number generator described above, and includes a comparison between the output of CPU and GPU code to show that the results are identical to machine precision
Current NVIDIA GPUs have double precision support, but it is 2-4 times slower than single precision. Similarly, when using SSE vectorisation on Intel CPUs double precision is 2 times slower than single precision.
Many in the finance sector consider single precision to be inadequate, but my own view is that it is perfectly adequate for Monte Carlo applications except when computing sensitivities ("Greeks") by finite difference perturbation ("bumping").
It is also important to use either a double precision accumulator or some form of binary tree summation to minimise the accumulation of roundoff error when averaging the payoffs from a very large number of paths. The links below give a test implementation of a binary tree summation, and a reference on the error analysis of related methods:
As a first experiment with a finite difference application, in early 2008 I wrote a 3D Laplace solver using simple Jacobi iteration. Using a low-cost 8800GT card, Gerd Heber and I achieved a speedup of factor 50 relative to a single thread on a Xeon, and factor 10 compared to 8-threads running on two quad-core Xeons:
Using texture mapping, a new code achieves slightly poorer performance but with a much simpler code. This might be an excellent approach for applications in which there is more computation per grid point and so the performance penalty is minimal.
Following on from the early is, I developed a generic 3D ADI solver for three-factor finance applications. As well as demonstrating the parallel solution of he sets of tridiagonal equations which arise from the ADI time-marching, this work also demonstrates my interest in developing high-level packages which can be used without detailed understanding of CUDA programming. The user supplies a C routine which defines the drift, volatility, correlation and source functions which define the 3D PDE, and then the package carries out the parallel execution.
With help from visiting students Abinash Pati and Vignesh Sunderam in the summer of 2008, I achieved a speedup of factor 30 on a 8800GT relative to a single thread on a Xeon:
For other information about the use of GPUs in the finance industry and within Oxford University, please see my homepage.