GPUs on the Alliance clusters
If you plan on using GPUs on the Alliance clusters, there are a number of things you need to be aware of.
Available GPU models
You can find a list of GPU models for each cluster on the Alliance wiki.
While logged on any cluster, you can print a list of available GPU models for that cluster with:
sinfo -o "%G"|grep gpu|sed 's/gpu://g'|sed 's/),/\n/g'|cut -d: -f1|sort|uniqYou should also visit the wiki page for that cluster for additional information (here is an example for Fir).
Multi-instance GPU (MIG)
Multi-instance GPU (MIG) is an Nvidia technology that partitions GPUs into multiple isolated and independent instances (virtual GPUs), each with its dedicated compute and memory resources.
This technology only works for GPGPUs (general-purpose GPUs) and not for graphic rendering.
If your jobs cannot fully utilize the modern and high-performing GPUs that are present in the Alliance clusters, using GPU instances allows you to use less of your job priority and have your jobs start sooner.
To use a GPU instance instead of a full GPU, you select the profile name of the instance size that corresponds to the size of your job.
Refer to the Alliance wiki for more details on this.
GPU farming
If you want to run multiple instances of a CUDA application, but the application is too small to saturate a modern GPU, consider using GPU farming using CUDA Multi Process Service (MPS). Multiple instances will then be processed concurrently no the same GPU.
Refer to the Alliance wiki for more details on this as well.
CPUs/GPU ratio
Ideally, you want that neither the GPUs nor the CPUs become a bottleneck for your job by having a balanced approach, both in terms of their respective computing powers and in terms of their memory.
A balanced ratio highly depends on the type of task your code performs as well as the type of GPU you are using. Typical ratios vary from 4:1 to 12:1, but if your code contains heavy steps that can only be performed on CPUs (e.g. I/O, data preprocessing), you will want more CPUs per GPU. More powerful GPUs also require more CPUs to handle the increased data throughput.
The Alliance wiki maximum recommendations are (per GPU):
- on Fir, no more than 12 CPU cores
- on Narval, no more than 12 CPU cores
- on Nibi, no more than 14 CPU cores
- on Rorqual, no more than 16 CPU cores
RAM to vRAM ratio
You also need to pay attention to the vRAM of the GPU(s) you are using and pair this appropriately with RAM on your CPUs. Here too, the type of task you are performing and the GPU model you are using will dictate the ideal ratio.
2:1 to 4:1 RAM to vRAM works for most applications.
Requesting GPUs from Slurm
Below is the general syntax, however make sure to look at cluster-specific instructions.
GPUs spread anywhere
--gpus=<type>:<number>Example:
--gpus=h100:2GPUs per node
--gpus-per-node=<type>:<number>Example:
--gpus-per-node=h100:6Monitoring GPUs
Metrix portal
A growing number of Alliance clusters (as of March 2026, this includes Rorqual, Narval, and Nibi) have a Metrix portal that allows to monitor resource usage (CPU, GPU, memory, filesystem) in real time in a graphical and intuitive fashion from your browser.
Logging metrics
For clusters that don’t have a monitoring portal yet, you can run nvidia-smi on the compute nodes with GPU(s) to get a snapshot of GPU(s) usage while the job is running by adding the command nvidia-smi to your job script.
If you want to log the monitoring to a file, run instead:
nvidia-smi -l 30 > gpu_log.txtReplace 30 by the number of seconds that is appropriate for the monitoring of your job (based on the length of your job).
You can also query specific metrics with, for instance:
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csvThis will return information on the GPU(s) utilization and memory used.
Live monitoring
If you want live monitoring, you can use watch to run nvidia-smi periodically and use srun to access the compute node of your choice.
Example to monitor the GPU(s) every second if you have a single GPU node:
srun --jobid=<your_running_job_id> --overlap --pty watch -n 1 nvidia-smiYou can find the job ID with sq.
This will create a static dashboard in your terminal with values updated every second.
If you have multiple GPU nodes, you need to specify the node name:
srun --jobid=<your_running_job_id> --overlap --nodelist=<node_name> --pty watch -n 1 nvidia-smiYou can find the node name with:
srun --jobid=<your_running_job_id> -n1 -c1 scontrol show hostnameEven better than nvidia-smi for this however is nvtop.
Example to monitor the GPU(s) every second if you have a single GPU node:
srun --jobid=<your_running_job_id> --overlap --pty nvtopIf you have multiple GPU nodes, you need to specify the node name:
srun --jobid=<your_running_job_id> --overlap --nodelist=<node_name> --pty nvtopRefer to our wiki page for more information on nvtop.