Running jobs

Author

Marie-Hélène Burle

This section is a quick review on how to submit jobs to the scheduler (Slurm) on the Alliance clusters. For more detailed information, please visit our wiki.

There are two types of jobs that can be launched on an Alliance cluster: interactive jobs and batch jobs.

Don’t run computations on the login node: those are very small nodes not designed to handle anything heavy.

Interactive jobs

To run Python interactively, you should launch an salloc session.

Example:

salloc --time=xxx --mem-per-cpu=xxx --cpus-per-task=xxx

This takes you to a compute node where you can now launch Python (or even better IPython) to run computations:

ipython

Note that while interactive jobs are great for code development, they are not resource efficient: all the resources that you requested are blocked for you while your job is running, whether you are making use of them (running heavy computations) or not (thinking, typing code, running computations that use only a fraction of the requested resources).

Best to use this on sample data using few resources.

Scripts

Once you have a working and tested program, you should run a batch job on the resources you need to get your results. To run a Python script called <your_script>.py, you first need to write a job script:

Example:

<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=xxx
#SBATCH --mem-per-cpu=xxx
#SBATCH --cpus-per-task=xxx
#SBATCH --job-name="<your_job>"

source ~/env/bin/activate
python <your_script>.py

Then launch your job with:

sbatch <your_job>.sh

You can monitor your job with sq (an alias for squeue -u $USER $@).

Batch jobs are the best approach to run heavy computations requiring a lot of hardware.

It will save you lots of waiting time (Alliance clusters) or money (commercial clusters).