GPU programming concepts

Author

Marie-Hélène Burle

This section covers a few basic concepts of GPU programming so that you get a better idea of what is going on behind the scene when you use high-level GPU frameworks.

CPU vs GPU architecture

SM: Streaming multiprocessors
Collections of thread blocks (each thread block is itself a collection of threads).

GPC: Graphics Processing Clusters
Collections of SMs.

DRAM: Dynamic random access memory or dynamic RAM
The DRAM of a GPU is called the global memory because it is accessible to all SMs in that GPU.
But it is only “global” to that GPU.

Caches

Caches on chips are hierarchical.

Cache	Speed	In CPUs	In GPUs
L1	Fastest	One inside each CPU core	One per SM
L2	A little slower	One near each CPU core¹	A single L2 shared among all SMs
L3	Even slower²	One shared among all cores	GPUs do not have an L3 cache

¹ They serve as a buffer between L1s and the L3.

² But still much faster than the DRAM.

Programming model

The CUDA programming guide is a great place to learn the principles of CUDA programming. Alternative kernel-based frameworks (e.g. HIP, OpenCL), have a similar functioning.

Host/device

Terminology	Explanation
Host	CPU
Host memory or system memory	CPU memory
Current device	GPU
Device memory	GPU memory
Device code	Code running on GPU
Kernel	Function executed on GPU

Execution

CUDA follows a heterogeneous programming model in which it manages both CPU and GPU memory.

Here is how execution happens:

The code starts on the host.
CUDA APIs copy data from the host memory to the current device memory.
Kernels (GPU functions) are executed in parallel on the device: many GPU threads run computations in an SIMT (Single Instruction, Multiple Threads) fashion.
Finally, CUDA APIs copy the data back to the host memory.

Threads/warps/blocks/grids

Applications launch kernels on vast numbers of threads.

These threads are organized into groups of 32 called warps which are themselves organized into thread blocks which are optionally organized into clusters which are themselves organized into a grid.

Warps execute code following a Single-Instruction Multiple-Threads (SIMT) paradigm.

Because there are 32 threads per warp, it is best to specify thread blocks with a number of threads which is a multiple of 32. Otherwise the last warp of the thread block will have some lanes that are unused, leading to suboptimal GPU utilization.

This is why batch sizes in deep learning are often multiples of 32.

This is also why choosing 256 threads per block is so common in GPU programming (256 is a multiple of 32).

Scheduling

All threads of a thread block are executed in the same SM and can thus communicate using that SM L1 cache.

All thread blocks in a cluster are executed in the same GPC and can communicate and synchronize with each other using CUDA software interfaces.

Multiple GPUs

When you have multiple GPUs on a computer or compute node, the NVIDIA driver queries the motherboard and orders the GPUs based on their PCIe Bus ID (their physical slot on the motherboard). The GPU in the lowest PCIe slot becomes physical GPU 0, the next becomes 1, and so on.

By default, the current device is GPU 0, but you can set the current device to another GPU programmatically.

Streams

In CUDA, a stream is a queue of operations to perform on the GPU.

Operations placed in the same stream execute sequentially, in the order in which they were added to the queue.

Note that this does not mean that the code isn’t running in parallel! Each operation within the queue runs on a vast number of threads.

If you launch operations on multiple streams, they can be executed concurrently.