A brief introduction to CUDA

Author

Marie-Hélène Burle

NVIDIA CUDA Toolkit (usually referred to as “CUDA”) is a proprietary set of APIs, tools, and libraries by NVIDIA that allow to run parallel computing on graphics processing units (GPUs).

The toolkit includes a C/C++ compiler, debugging, benchmarking, and optimization tools, a runtime library, and many GPU-accelerated libraries.

Many open-source tools build on CUDA to provide GPU-acceleration in languages such as Python or Julia.

History

The use of GPUs for mathematical calculations rather than image rendering (general-purpose computing on GPUs or GPGPU)) was a nice and difficult research endeavour involving passing data through graphics shaders.

By providing a standard C++ programming model that removed the need to understand graphics APIs, CUDA brought GPGPU to the mainstream.

Support

Operating systems

OS	CUDA support
Linux	Yes
Windows	Yes, with limitations, provided Microsoft Visual Studio (MSVS) is installed
WSL 2	Yes, with limitations
macOS	No¹

¹ Legacy experimental support on older systems was abandoned.

Accelerators

Accelerator	CUDA support
NVIDIA GPUs	Yes
Apple GPUs	No
AMD GPUs	Yes, but not natively²
TPUs	No³

² See below.

³ TPUs use XLA (Accelerated Linear Algebra) as their compiler backend.

Support for AMD GPUs

3 tools with different approaches exist to run CUDA on AMD hardware:

	HIP	ZLUDA	SCALE tookit
Functioning	`hipify` CUDA code into neutral language, then compile to target GPU (AMD or NVIDIA)	Translates compiled CUDA code at runtime to execute on AMD GPUs	Compiles CUDA code directly into AMD GPU binaries with SCALE compiler instead of `nvcc`
Code modification required	Yes	No	No
Recompilation required	Yes	No	Yes
Performance	Native	Near-native	Native
Maturity	Yes	Yes	No
Library gap	Yes	Yes	Yes
Developer	AMD (ROCm ecosystem)	Community driven	Spectral Compute (startup)
Open-source	Yes	Yes	No
Pros	Native performance, officially supported	Very easy to use	Native performance, easier than HIP
Cons	Requires a lot of work	Lower performance, might violate NVIDIA’s End User License Agreement	Proprietary and less tested

On Alliance supercomputers

The good news is that the Alliance supercomputers use Linux and most GPUs are NVIDIA’s (there are a few AMD’s on the Nibi cluster).

CUDA is installed as an Lmod module on the clusters:

# To see available versions:
module spider cuda

# To check the needed dependencies of a specific version (e.g. 12.6):
module spider cuda/12.6

# To load a version:
module load cuda/12.6

Alternatives to CUDA

	Open	Usage	More info
ROCm	Yes	GPGPU	Part of the AMD ROCm stack for AMD GPUs
Triton	No (OpenAI)	Deep learning	Python-based domain-specific language (DSL) and compiler
OpenCL	Yes	GPGPU	Runs on most hardware, but with lower performance than CUDA and a smaller toolkit
SYCL	Yes (supported by Intel)	GPGPU	Runs on most hardware, but with lower performance than CUDA and a smaller toolkit
Vulkan	Yes	Graphics & GPGPU	Runs on most hardware, but with much lower performance for GPGPU than CUDA
OpenACC	Yes	Offload loops to GPUs by inserting compiler hints in the code	Runs on most hardware, but restricted to C, C++, and Fortran and lower performance than CUDA. Easy solution for old codebase in compiled languages

CUDA for Python

Low-level wrappers

There are two wrappers that provide low-level control of CUDA in Python:

PyCUDA: an older, community-driven library.
CUDA Python: a newer stack by NVIDIA.

For new projects and unless you are already familiar with PyCUDA, favour CUDA Python.

CUDA Python consists of a collection of packages.

You write CUDA kernels in C++ .cu files, compile them using nvcc, then execute them using cuda-python.

High-level for NumPy

CuPy is a high-level drop-in replacement for NumPy and SciPy. It also allows to pass custom kernels to CUDA.

High-level for NumPy with JIT

Numba is a NumPy optimizing library using the just-in-time (JIT) compiler LLVM.

The numba-cuda package by NVIDIA uses the low-level cupy-python bindings from CUDA Python under the hood and allows to run Numba on GPUs.

High-level for DataFrames

RAPIDS cuDF is a library allowing to run Python DataFrames operations on GPUs. It can be used in 3 ways:

	cuDF	cuDF pandas module	Polars with GPU engine
API	Similar to pandas Some differences	Exactly the same as pandas	Exactly the same as Polars lazy API
Performance	Very good Operations fully run on GPU	Good Automatic fallbacks to CPU for unsupported operations can lead to costly transfers between CPU and GPU	The best Lazy execution Operations fully run on GPU

High-level for ML

RAPIDS cuML is a suite of libraries with an API similar to scikit-learn’s that allow to run tabular machine learning tasks on GPUs.

High-level for DL

Many deep learning frameworks come with CUDA integration. This allows to run code on GPU with no (e.g. JAX) or little (e.g. PyTorch, TensorFlow) effort.