Python at scale

Author

Marie-Hélène Burle

This section gives a brief introduction on how to use Python efficiently at scale on the Digital Research Alliance of Canada supercomputers.

The problem with Jupyter

When you launch a Jupyter session from a JupyterHub, you are running a Slurm job on compute nodes. If you want to play for 8 hours in Jupyter, you are requesting an 8 hour job. Now, most of the time you spend on Jupyter is spent typing, thinking, running bits and pieces of code, or doing nothing at all. If you ask for GPUs, many CPUs, and lots of RAM, all of it will remain idle most of the time. This is a suboptimal use of resources.

In addition, if you ask for lots of resources for a long time, you will have to wait for a while before they get allocated to you.

Lastly, you will go through your allocations quickly.

All of this applies equally for interactive sessions launched from an SSH session with salloc.

Ideally, you want to ensure that heavy duty resources that you request are actually put to use to run your heavy calculations and do not seat idle.

A better approach

Below is what the workflow looks like.

Prototype your code

Develop and test your Python code interactively to take advantage of the fact that Python is an interpreted language. Do this with little data, few samples, few iterations, and any other method you can use to make the code run faster. Use your own machine, a small interactive job in the cluster with salloc using IPython, or a small JupyterHub session.

Once you are confident that your code works, move on to the next steps.

SSH connection to a cluster

You will need the username and password that you already used to access JupyterHub. In addition, you will now need a hostname that we will give you.

Run the ssh command

 •  Linux and macOS users

Linux users:   open the terminal emulator of your choice.
macOS users:   open “Terminal”.

Then type:

ssh userxx@hostname

and press Enter.

  • Replace userxx by your username (e.g. user09).
  • Replace hostname by the hostname we will give you the day of the workshop.

When asked:

Are you sure you want to continue connecting (yes/no/[fingerprint])?

Answer: “yes”.

 •  Windows users

We suggest using the free version of MobaXterm, a software that comes with a terminal emulator and a GUI interface for SSH sessions.

Here is how to install MobaXterm:

  • download the “Installer edition” to your computer (green button to the right),
  • unzip the file,
  • double-click on the .msi file to launch the installation.

Here is how to log in with MobaXterm:

  • open MobaXterm,
  • click on Session (top left corner),
  • click on SSH (top left corner),
  • fill in the Remote host * box with the cluster hostname we gave you,
  • tick the box Specify username,
  • fill in the box with the username you selected (e.g. user09),
  • press OK,
  • when asked Are you sure you want to continue connecting (yes/no/[fingerprint])?, answer: “yes”.

Here is a live demo.

Enter the password

When prompted, enter the password we gave you.

You will not see anything happen as you type the password. This is normal and it is working, so keep on typing the password.

This is called blind typing and is a Linux safety feature. It can be unsettling at first not to get any feed-back while typing as it really looks like it is not working. Type slowly and make sure not to make typos.

Then press Enter.

You are now logged in and your prompt should look like the following (with your actual username):

[userxx@login1 ~]$

Troubleshooting

Problems logging in are almost always due to typos. If you cannot log in, retry slowly, entering your password carefully.

Load a Python module

This is done with the Lmod tool through the module command. You can find the full documentation here and below are the commands you will need:

Get help on the module command:

module -h

module help and module --help are synonyms.

List modules that are already loaded:

module list

See which modules are available for Python:

module spider python

As you can see, there are many versions available.

See how to load Python 3.13.2:

module spider python/3.13.2

To load Python 3.13.2, you need StdEnv/2023, but it is already loaded by default (you can verify this with module list), so we can load our Python module directly:

module load python/3.13.2

You can see that we now have Python 3.13.2 loaded:

module list

Verify the Python version:

python --version

Copy files to the cluster

If you need to copy files to the cluster, open a new terminal window on your machine and from your local terminal (make sure that you are not on the remote terminal by looking at the bash prompt) run:

scp /local/path/to/file <username>@<hostname>:path/in/cluster

# enter password

For those using MobaXTerm, you can also drag and drop files in the GUI.

Virtual environment

Create and activate a virtual environment for the job:

python -m venv ~/env
source ~/env/bin/activate
python -m pip install --upgrade --no-index pip
python -m pip install --no-index <package>

Launch a job

Now you need to write a job script for the Slurm scheduler. This is a text file with a .sh extension which contains all the sbatch options and the actions of the job.

The Alliance wiki is a great source of information on how to get started. The section on running jobs in particular should be very useful.

Here is an example of Slurm script:

#!/bin/bash
#SBATCH --job-name=python_run1      # job name
#SBATCH --time=05:00:00             # max walltime 5 h
#SBATCH --cpus-per-task=8           # number of cores
#SBATCH --mem=9000                  # max memory (default unit is megabytes)
#SBATCH --output=python_run1%j.out  # file name for the output
#SBATCH --error=python_run1%j.err   # file name for errors

# run your Python script 
python <your-script-name>.py

%j gets replaced with the job number which is automatically generated by Slurm.

Then, you can submit your job to the cluster:

sbatch <your-slurm-script-name>.sh

And you can check its status with:

sq    # This is an Alliance alias for `squeue -u $USER $@`

PD stands for pending
R stands for running

If you don’t get any output when you run sq, it means that the job has finished running.