A brief introduction to CUDA
NVIDIA CUDA Toolkit (usually referred to as “CUDA”) is a proprietary set of APIs, tools, and libraries by NVIDIA that allow to run parallel computing on graphics processing units (GPUs).
The toolkit includes a C/C++ compiler, debugging, benchmarking, and optimization tools, a runtime library, and many GPU-accelerated libraries.
Many open-source tools build on CUDA to provide GPU-acceleration in languages such as Python or Julia.
History
The use of GPUs for mathematical calculations rather than image rendering (general-purpose computing on GPUs or GPGPU)) was a nice and difficult research endeavour involving passing data through graphics shaders.
By providing a standard C++ programming model that removed the need to understand graphics APIs, CUDA brought GPGPU to the mainstream.
Support
Operating systems
| OS | CUDA support |
|---|---|
| Linux | Yes |
| Windows | Yes, with limitations, provided Microsoft Visual Studio (MSVS) is installed |
| WSL 2 | Yes, with limitations |
| macOS | No1 |
1 Legacy experimental support on older systems was abandoned.
Accelerators
| Accelerator | CUDA support |
|---|---|
| NVIDIA GPUs | Yes |
| Apple GPUs | No |
| AMD GPUs | Yes, but not natively2 |
| TPUs | No3 |
2 See below.
3 TPUs use XLA (Accelerated Linear Algebra) as their compiler backend.
Support for AMD GPUs
3 tools with different approaches exist to run CUDA on AMD hardware:
| HIP | ZLUDA | SCALE tookit | |
|---|---|---|---|
| Functioning | hipify CUDA code into neutral language, then compile to target GPU (AMD or NVIDIA) |
Translates compiled CUDA code at runtime to execute on AMD GPUs | Compiles CUDA code directly into AMD GPU binaries with SCALE compiler instead of nvcc |
| Code modification required | Yes | No | No |
| Recompilation required | Yes | No | Yes |
| Performance | Native | Near-native | Native |
| Maturity | Yes | Yes | No |
| Library gap | Yes | Yes | Yes |
| Developer | AMD (ROCm ecosystem) | Community driven | Spectral Compute (startup) |
| Open-source | Yes | Yes | No |
| Pros | Native performance, officially supported | Very easy to use | Native performance, easier than HIP |
| Cons | Requires a lot of work | Lower performance, might violate NVIDIA’s End User License Agreement | Proprietary and less tested |
On Alliance supercomputers
The good news is that the Alliance supercomputers use Linux and most GPUs are NVIDIA’s (there are a few AMD’s on the Nibi cluster).
CUDA is installed as an Lmod module on the clusters:
# To see available versions:
module spider cuda
# To check the needed dependencies of a specific version (e.g. 12.6):
module spider cuda/12.6
# To load a version:
module load cuda/12.6Alternatives to CUDA
| Open | Usage | More info | |
|---|---|---|---|
| ROCm | Yes | GPGPU | Part of the AMD ROCm stack for AMD GPUs |
| Triton | No (OpenAI) | Deep learning | Python-based domain-specific language (DSL) and compiler |
| OpenCL | Yes | GPGPU | Runs on most hardware, but with lower performance than CUDA and a smaller toolkit |
| SYCL | Yes (supported by Intel) | GPGPU | Runs on most hardware, but with lower performance than CUDA and a smaller toolkit |
| Vulkan | Yes | Graphics & GPGPU | Runs on most hardware, but with much lower performance for GPGPU than CUDA |
| OpenACC | Yes | Offload loops to GPUs by inserting compiler hints in the code | Runs on most hardware, but restricted to C, C++, and Fortran and lower performance than CUDA. Easy solution for old codebase in compiled languages |
CUDA for Python
Low-level wrappers
There are two wrappers that provide low-level control of CUDA in Python:
- PyCUDA: an older, community-driven library.
- CUDA Python: a newer stack by NVIDIA.
For new projects and unless you are already familiar with PyCUDA, favour CUDA Python.
CUDA Python consists of a collection of packages.
You write CUDA kernels in C++ .cu files, compile them using nvcc, then execute them using cuda-python.
High-level for NumPy
CuPy is a high-level drop-in replacement for NumPy and SciPy. It also allows to pass custom kernels to CUDA.
High-level for NumPy with JIT
Numba is a NumPy optimizing library using the just-in-time (JIT) compiler LLVM.
The numba-cuda package by NVIDIA uses the low-level cupy-python bindings from CUDA Python under the hood and allows to run Numba on GPUs.
High-level for DataFrames
RAPIDS cuDF is a library allowing to run Python DataFrames operations on GPUs. It can be used in 3 ways:
| cuDF | cuDF pandas module | Polars with GPU engine | |
|---|---|---|---|
| API | Similar to pandas Some differences |
Exactly the same as pandas | Exactly the same as Polars lazy API |
| Performance | Very good Operations fully run on GPU |
Good Automatic fallbacks to CPU for unsupported operations can lead to costly transfers between CPU and GPU |
The best Lazy execution Operations fully run on GPU |
High-level for ML
RAPIDS cuML is a suite of libraries with an API similar to scikit-learn’s that allow to run tabular machine learning tasks on GPUs.
High-level for DL
Many deep learning frameworks come with CUDA integration. This allows to run code on GPU with no (e.g. JAX) or little (e.g. PyTorch, TensorFlow) effort.