

# Learning outcomes

In this lecture we will look at the current landscape of accelerated computing. We will look at hardware and software trends and potential future directions for accelerated computing.































# Frontier – The worlds first Exaflop machine

Hosted at the Oak Ridge Leadership Computing Facility (OLCF) Tennessee, Frontier is the worlds only ExaFLOP supercomputer.

It was delivered in partnership with HPE (Cray) and was also the worlds "greenest" supercomputer when it became operational in May 2022. https://www.top500.org/lists/green500/2022/06/

Great presentation by Bronson Messer (Director of Science):

http://www.phys.utk.edu/archives/colloquium/ 2022/10-03-messer.pdf



By OLCF at ORNL - https://www.tlickr.com/photos/olcf/52117623843/, CC BY 2. https://commons.wikimedia.org/w/index.php?curid=119231238 17



18

# Frontier – Specs

- 9472 AMD Epyc "Trento" 64 core 2 GHz CPUs.
- 37888 Radeon Instinct MI250X GPUs.
- HPE Slingshot interconnect.
- Frontier is liquid-cooled, allowing 5x the density of an air-cooled architecture.
- Each rack holds 64 blades, each blade has two nodes.
- A node consists of one CPU, 4x GPUs (each having 128GB memory), 512 GB RAM and 4TB of flash memory.
- 21 Megawatts

https://docs.olcf.ornl.gov/systems/frontier\_user\_guide.html



AMD as a solution? Hardware

## We see from the change in the top500, AMD GPUs are now gaining traction in HPC and scientific computing.

This is because when the total cost of ownership was considered for both Frontier and LUMI, it was decided that AMD GPUs would be more cost effective.

DoE spent approximately 1/3 of their budget on hardware, the other 2/3 was on software porting and running costs.

A bit more on the MI250X that Frontier uses: 2x 64GB of HBM2e, 3.2TB/s bandwidth, 48TFLOP/s (fp32 and fp64) and 500 Watts TDP.



The new MI300A will be used in the 2 Eflop El Capitan machine

 $\underline{https://www.amd.com/en/products/specifications/professional-graphics/4476, 19496}$ 







# Heterogeneous-Compute Interface for Portability (HIP)

**HIP is AMDs "version" of CUDA**, it's a Kernel Language that looks, in many parts, similar to CUDA.

It aims to allow you to create applications that are portable, so when you write in HIP, your code will be able to run not only AMD GPUs, but NVIDIA also (at least that's the aim, just like OpenCL...).

## AMD Claim:

- HIP has little (or no) performance impact compared to coding directly in CUDA.
- HIP allows coding in a single-source C/C++ programming language.
- The HIPIFY tools automatically convert most source from CUDA to HIP.
- HIP.
   Developers can specialize for the platform (CUDA or AMD) to tune for performance or handle tricky cases.



https://github.com/ROCm-Developer-Tools/HIP

https://www.youtube.com/watch?v=hSwgh-BXx3i https://www.lumi-supercomputer.eu/preparing-codes-for-lumi-converting-cuda-applications-to-hip,

# Heterogeneous-Compute Interface for Portability (HIP)

## Let's look at some HIP (the main() code) ...

char\* inputBuffer; char\* outputBuffer;

hipMalloc((void\*\*)&inputBuffer, (strlength + 1) \* sizeof(char)); hipMalloc((void\*\*)&outputBuffer, (strlength + 1) \* sizeof(char));

hipMemcpy(inputBuffer, input, (strlength + 1) \* sizeof(char), hipMemcpyHostToDevice);

hipMemcpy(output, outputBuffer,(strlength + 1) \* sizeof(char), hipMemcpyDeviceToHost);

https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/HIP-Examples-Applications/HelloWorld/HelloWorld.cpp

hipFree(inputBuffer); hipFree(outputBuffer);

25

# Heterogeneous-Compute Interface for Portability (HIP)

Let's look at some HIP (the kernel code) ..

It all looks rather familiar, almost like someone has done a global "find cuda replace with hip" ...

https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/HIP-Examples-Applications/HelloWorld/HelloWorld.cpp

26

# **HIPIFY**

| HIPIFY is a set of scripts that will (try) to translate your<br>CUDA source code into HIP automaticallymagically for you.                                                                                    | Supported CUDA APIs #                                                                     |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| The scripts are based on perl and clang.<br>Jack tried to take our AstroAccelerate code base<br>(admittedly it is large and in parts quite complicated) and<br>use HIPIY to generate an AMD executable code. | Runtime API     Driver API     cuComptex API     Device API     RTC API                   |
| He wasn't able (through no fault of his own!!).<br>When Jack emailed support he was pointed to the git repo<br>and asked to raise an issue.                                                                  | <ul> <li>cuRAND</li> <li>cuRAND</li> <li>cuDNN</li> <li>cuFT</li> <li>cuSPARSE</li> </ul> |
| So some work to do before this is truly automagical.                                                                                                                                                         | • CUB                                                                                     |
|                                                                                                                                                                                                              | https://github.com/ROCm-Developer-Tools/HIPIFY                                            |



## Intel Xe-HPC - Ponte Vecchio GPU

The Ponte Vecchio GPU is used in Aurora.

Intel specs are: 45 TFLOP/s, 5 TB/s bandwidth and 2 TB/s connectivity (I think this is Xe Link).

Tests show that for some applications this reaches about 80% of the performance of an A100.

We should keep in mind though, that the A100 is four years old, NVIDIAs current flagship GPU is the H100, soon to be replaced by Blackwell.



## OneAPI

OneAPI is Intel's answer to HIP. It is an open standard and aims to deliver a single unified API that can be used across all of its products from FPGAs to GPUs to CPUs.

It aims to go further that just Intel products. OneAPI has some functionality for both NVIDIA and AMD GPUs (via Codeplay plugins).

This work is part of Intel's plan to make oneAPI the preferred alternative for heterogeneous, parallel programming.

One ring to rule them all...



https://www.intel.com/content/www/us/en/developer/tools/oneapi/code-samples.html#gs.2twqvx https://www.eipurnal.com/article/intels/atest-version-Goneapi-take-advantage-of-new-intel-seon-improvements-supportand send-wide/indevspassure-takedin\_coransc.\_sonea@content-si-ids\_BildedSou0207031089

30

# <text><text><text><text><text><text>

### Graphcore GRAPHCORE COLOSSUS MK2 59.4Bn transistors, TSMC 7nm @ 823mm Graphcore produce the Colossus Intelligent Processing Unit. 250TFlops Al-Float | 900MB In-Processor-Memory 1472 independent processor cores The Mark 2 IPU was released in 2020. The system design is 8832 separate parallel threads aimed at sparse problems and has a memory system that is >8x step-up in system performance vs Mk\* ideal for large AI models. The Mark 3 IPU is still in development, aiming to double the performance of the Mark 2 IPU. GRAPHCORE A100 (8x A100 For certain application spaces graphcore products are more FP32 comp 156TFLOP 2PELOP >12x than competitive with NVIDIA GPUs. Al compu 2.5PFLOP<sup>13</sup> >3x Al Memor 320GB<sup>[3</sup> >10x 3.6TB<sup>[4]</sup> \$199,000 .... System Price \$259,600

 $\sim$ 









# NVIDIA – Grace-Hopper

Grace-Hopper is NVIDIAs answer to the likes of Cerebras and Graphcore. The "Superchip" combines a Grace CPU and a Hopper GPU using NVLink C2C to deliver a CPU+GPU coherent memory model. The fruition of project Denver begun by NVIDIA in (Circa) 2014.

This kind of design will be crucial in progressing exascale computing in the years to come.

Whitepaper: https://resources.nvidia.com/en-us-gracecpu/nvidia-grace-hopper





## NVIDIA – Grace-Hopper

NVIDIA Grace + Hopper:

- 72x Arm Neoverse V2 cores (4×128-bit SIMD units per core).
- Up to 117 MB of L3 Cache.
- Up to 512 GB of LPDDR5X memory (546 GB/s of memory bandwidth).
- Up to 64x PCIe Gen5 lanes.
- NVIDIA Scalable Coherency Fabric (SCF) mesh and distributed cache with up to 3.2 TB/s memory bandwidth.
- NVIDIA Hopper GPU.
- NVIDIA NVLink-C2C Up to 900 GB/s total bandwidth.
- Unified address space each Hopper GPU can address up to
- 608 GB of memory within a superchip. • NVIDIA NVLink Switch System connects up to 256x NVIDIA
- Grace Hopper Superchips using NVLink 4. • Each NVLink-connected Hopper GPU can address all HBM3 and
- Each NVLink-connected Hopper GPU can address all HBM3 and LPDDR5X memory of all superchips in the network, for up to 150 TB of GPU addressable memory.



38

37









# The future?

It's likely due to the cost of NVIDIA and shortage of supply that AMD will get a growing fraction of the accelerator market, especially given that they seem to be following (very closely!) NVIDIAs strategy – a great software ecosystem.

The HPE El Capitan supercomputer, due to be delivered Q4 2024 is an upcoming exascale supercomputer, hosted at the Lawrence Livermore, will be a 2+ ExaFLOP supercomputer and will displace Frontier as the world's fastest supercomputer.

- It's based on ... AMD.



# Summary

This lecture has looked at some present alternatives to NVIDIA and CUDA. We've also taken a look at some up-coming technologies, both software and hardware that might we worth watching out for over the coming years.

Lots of what you have learnt this week is transferable!

Also keep an eye on Mikes computing webpage here:

https://people.maths.ox.ac.uk/gilesm/computing.html



42