High-performance computing in Python

Author

Marie-Hélène Burle

This section gives a brief introduction on how to use Python on the Digital Research Alliance of Canada supercomputers.

Interactive sessions for HPC

When you launch a Jupyter session from a JupyterHub, you are running a Slurm job on a compute node. If you want to play for 8 hours in Jupyter, you are requesting an 8 hour job. Now, most of the time you spend on Jupyter is spent typing, running bits and pieces of code, or doing nothing at all. If you ask for GPUs, many CPUs, and lots of RAM, all of it will remain idle most of the time. This is a suboptimal use of resources.

In addition, if you ask for lots of resources for a long time, you will have to wait for a while before they get allocated to you.

Lastly, you will go through your allocations quickly.

All of this applies equally for interactive sessions launched from an SSH session with salloc.

A better approach

A more efficient strategy is to develop and test your code with small samples, few iterations, etc. in an interactive job (from an SSH session in the cluster with salloc), on your own computer, or in Jupyter. Once you are confident that your code works, launch an sbatch job from an SSH session in the cluster to run the code as a script on all your data. This ensures that heavy duty resources that you requested are actually put to use to run your heavy calculations and not seating idle while you are thinking, typing, etc.

Logging in the cluster

Open a terminal emulator:

Windows users:  launch MobaXTerm.
MacOS users:   launch Terminal.
Linux users:     launch xterm or the terminal emulator of your choice.

Then access the cluster through secure shell:

$ ssh <username>@<hostname>    # enter password

Accessing Python

This is done with the Lmod tool through the module command. You can find the full documentation here and below are the subcommands you will need:

# get help on the module command
$ module help
$ module --help
$ module -h

# list modules that are already loaded
$ module list

# see which modules are available for Python
$ module spider python

# see how to load Python 3.10.2
$ module spider python/3.10.2

# load Python 3.10.2 with the required gcc module first
# (the order is important)
$ module load gcc/7.3.0 python/3.10.2

# you can see that we now have Python 3.10.2 loaded
$ module list

Copying files to the cluster

We will create a python_workshop directory in ~/scratch, then copy our Python script in it.

$ mkdir ~/scratch/python_job

Open a new terminal window and from your local terminal (make sure that you are not on the remote terminal by looking at the bash prompt) run:

$ scp /local/path/to/sort.jl <username>@<hostname>:scratch/python_job
$ scp /local/path/to/psort.jl <username>@<hostname>:scratch/python_job

# enter password

Job scripts

We will not run an interactive session with Python on the cluster: we already have Python scripts ready to run. All we need to do is to write job scripts to submit to Slurm, the job scheduler used by the Alliance clusters.

We will create 2 scripts: one to run Python on one core and one on as many cores as are available.

Your turn:

How many processors are there on our training cluster?

Save your job scripts in the files ~/scratch/python_job/job_python1c.sh and job_python2c.sh for one and two cores respectively.

Here is what our single core Slurm script looks like:

#!/bin/bash
#SBATCH --job-name=python1c         # job name
#SBATCH --time=00:01:00             # max walltime 1 min
#SBATCH --cpus-per-task=1           # number of cores
#SBATCH --mem=1000                  # max memory (default unit is megabytes)
#SBATCH --output=python1c%j.out     # file name for the output
#SBATCH --error=python1c%j.err      # file name for errors
# %j gets replaced with the job number

python sort.py

Your turn:

Write the script for 2 cores.

Now, we can submit our jobs to the cluster:

$ cd ~/scratch/python_job
$ sbatch job_python1c.sh
$ sbatch job_python2c.sh

And we can check their status with:

$ sq      # This is an Alliance alias for `squeue -u $USER $@`

PD stands for pending
R stands for running

Run Python on our training cluster

This is not the method I recommend for this workshop, but I am adding it as this is something you might want to use if you need to run heavy computations.

First, you need to load the Python module.

See which Python modules are available:

module spider python

See how to install one module:

Example:

module spider python/3.11.5

Load the required dependencies (first) and the module:

module load StdEnv/2023 python/3.11.5

You can check that the modules were loaded with:

module list

And verify the Python version with:

python --version