GPU programming concepts

Author

Marie-Hélène Burle

This section covers a few basic concepts of GPU programming so that you get a better idea of what is going on behind the scene when you use high-level GPU frameworks.

CPU vs GPU architecture

SM: Streaming multiprocessors
Collections of thread blocks (each thread block is itself a collection of threads).

GPC: Graphics Processing Clusters
Collections of SMs.

DRAM: Dynamic random access memory or dynamic RAM
The DRAM of a GPU is called the global memory because it is accessible to all SMs in that GPU.
But it is only “global” to that GPU.

Caches

Caches on chips are hierarchical.

Cache Speed In CPUs In GPUs
L1 Fastest One inside each CPU core One per SM
L2 A little slower One near each CPU core1 A single L2 shared among all SMs
L3 Even slower2 One shared among all cores GPUs do not have an L3 cache

1 They serve as a buffer between L1s and the L3.

2 But still much faster than the DRAM.

Programming model

The CUDA programming guide is a great place to learn the principles of CUDA programming. Alternative kernel-based frameworks (e.g. HIP, OpenCL), have a similar functioning.

Host/device

Terminology Explanation
Host CPU
Host memory or system memory CPU memory
Current device GPU
Device memory GPU memory
Device code Code running on GPU
Kernel Function executed on GPU

Execution

CUDA follows a heterogeneous programming model in which it manages both CPU and GPU memory.

Here is how execution happens:

  • The code starts on the host.
  • CUDA APIs copy data from the host memory to the current device memory.
  • Kernels (GPU functions) are executed in parallel on the device: many GPU threads run computations in an SIMT (Single Instruction, Multiple Threads) fashion.
  • Finally, CUDA APIs copy the data back to the host memory.

Threads/warps/blocks/grids

Applications launch kernels on vast numbers of threads.

These threads are organized into groups of 32 called warps which are themselves organized into thread blocks which are optionally organized into clusters which are themselves organized into a grid.

Warps execute code following a Single-Instruction Multiple-Threads (SIMT) paradigm.

Because there are 32 threads per warp, it is best to specify thread blocks with a number of threads which is a multiple of 32. Otherwise the last warp of the thread block will have some lanes that are unused, leading to suboptimal GPU utilization.

This is why batch sizes in deep learning are often multiples of 32.

This is also why choosing 256 threads per block is so common in GPU programming (256 is a multiple of 32).

Scheduling

All threads of a thread block are executed in the same SM and can thus communicate using that SM L1 cache.


All thread blocks in a cluster are executed in the same GPC and can communicate and synchronize with each other using CUDA software interfaces.

Multiple GPUs

When you have multiple GPUs on a computer or compute node, the NVIDIA driver queries the motherboard and orders the GPUs based on their PCIe Bus ID (their physical slot on the motherboard). The GPU in the lowest PCIe slot becomes physical GPU 0, the next becomes 1, and so on.

By default, the current device is GPU 0, but you can set the current device to another GPU programmatically.

Streams

In CUDA, a stream is a queue of operations to perform on the GPU.

Operations placed in the same stream execute sequentially, in the order in which they were added to the queue.

Note that this does not mean that the code isn’t running in parallel! Each operation within the queue runs on a vast number of threads.

If you launch operations on multiple streams, they can be executed concurrently.