A brief introduction to CUDA

Author

Marie-Hélène Burle

NVIDIA CUDA Toolkit (usually referred to as “CUDA”) is a proprietary set of APIs, tools, and libraries by NVIDIA that allow to run parallel computing on graphics processing units (GPUs).

The toolkit includes a C/C++ compiler, debugging, benchmarking, and optimization tools, a runtime library, and many GPU-accelerated libraries.

Many open-source tools build on CUDA to provide GPU-acceleration in languages such as Python or Julia.

History

The use of GPUs for mathematical calculations rather than image rendering (general-purpose computing on GPUs or GPGPU)) was a nice and difficult research endeavour involving passing data through graphics shaders.

By providing a standard C++ programming model that removed the need to understand graphics APIs, CUDA brought GPGPU to the mainstream.

Support

Operating systems

OS CUDA support
Linux Yes
Windows Yes, with limitations, provided Microsoft Visual Studio (MSVS) is installed
WSL 2 Yes, with limitations
macOS No1

1 Legacy experimental support on older systems was abandoned.

Accelerators

Accelerator CUDA support
NVIDIA GPUs Yes
Apple GPUs No
AMD GPUs Yes, but not natively2
TPUs No3

2 See below.

3 TPUs use XLA (Accelerated Linear Algebra) as their compiler backend.

Support for AMD GPUs

3 tools with different approaches exist to run CUDA on AMD hardware:

HIP ZLUDA SCALE tookit
Functioning hipify CUDA code into neutral language, then compile to target GPU (AMD or NVIDIA) Translates compiled CUDA code at runtime to execute on AMD GPUs Compiles CUDA code directly into AMD GPU binaries with SCALE compiler instead of nvcc
Code modification required Yes No No
Recompilation required Yes No Yes
Performance Native Near-native Native
Maturity Yes Yes No
Library gap Yes Yes Yes
Developer AMD (ROCm ecosystem) Community driven Spectral Compute (startup)
Open-source Yes Yes No
Pros Native performance, officially supported Very easy to use Native performance, easier than HIP
Cons Requires a lot of work Lower performance, might violate NVIDIA’s End User License Agreement Proprietary and less tested

On Alliance supercomputers

The good news is that the Alliance supercomputers use Linux and most GPUs are NVIDIA’s (there are a few AMD’s on the Nibi cluster).

CUDA is installed as an Lmod module on the clusters:

# To see available versions:
module spider cuda

# To check the needed dependencies of a specific version (e.g. 12.6):
module spider cuda/12.6

# To load a version:
module load cuda/12.6

Alternatives to CUDA

Open Usage More info
ROCm Yes GPGPU Part of the AMD ROCm stack for AMD GPUs
Triton No (OpenAI) Deep learning Python-based domain-specific language (DSL) and compiler
OpenCL Yes GPGPU Runs on most hardware, but with lower performance than CUDA and a smaller toolkit
SYCL Yes (supported by Intel) GPGPU Runs on most hardware, but with lower performance than CUDA and a smaller toolkit
Vulkan Yes Graphics & GPGPU Runs on most hardware, but with much lower performance for GPGPU than CUDA
OpenACC Yes Offload loops to GPUs by inserting compiler hints in the code Runs on most hardware, but restricted to C, C++, and Fortran and lower performance than CUDA. Easy solution for old codebase in compiled languages

CUDA for Python

Low-level wrappers

There are two wrappers that provide low-level control of CUDA in Python:

For new projects and unless you are already familiar with PyCUDA, favour CUDA Python.

CUDA Python consists of a collection of packages.

You write CUDA kernels in C++ .cu files, compile them using nvcc, then execute them using cuda-python.

High-level for NumPy

CuPy is a high-level drop-in replacement for NumPy and SciPy. It also allows to pass custom kernels to CUDA.

High-level for NumPy with JIT

Numba is a NumPy optimizing library using the just-in-time (JIT) compiler LLVM.

The numba-cuda package by NVIDIA uses the low-level cupy-python bindings from CUDA Python under the hood and allows to run Numba on GPUs.

High-level for DataFrames

RAPIDS cuDF is a library allowing to run Python DataFrames operations on GPUs. It can be used in 3 ways:

cuDF cuDF pandas module Polars with GPU engine
API Similar to pandas
Some differences
Exactly the same as pandas Exactly the same as Polars lazy API
Performance Very good
Operations fully run on GPU
Good
Automatic fallbacks to CPU for unsupported operations can lead to costly transfers between CPU and GPU
The best
Lazy execution
Operations fully run on GPU

High-level for ML

RAPIDS cuML is a suite of libraries with an API similar to scikit-learn’s that allow to run tabular machine learning tasks on GPUs.

High-level for DL

Many deep learning frameworks come with CUDA integration. This allows to run code on GPU with no (e.g. JAX) or little (e.g. PyTorch, TensorFlow) effort.