Introduction to high performance research computing in Julia
Interactive sessions for high-performance computing
When you launch a Jupyter session from a JupyterHub, you are running a Slurm job on a compute node. If you want to play for 8 hours in Jupyter, you are requesting an 8 hour job. Now, most of the time you spend on Jupyter is spent typing, running bits and pieces of code, or doing nothing at all. If you ask for GPUs, many CPUs, and lots of RAM, all of it will remain idle most of the time. This is a suboptimal use of resources.
In addition, if you ask for lots of resources for a long time, you will have to wait for a while before they get allocated to you.
Lastly, you will go through your allocations quickly.
All of this applies equally for interactive sessions launched from an SSH session with salloc
.
A better approach
A more efficient strategy is to develop and test your code with small samples, few iterations, etc. in an interactive job (from an SSH session in the cluster with salloc
), on your own computer, or in Jupyter. Once you are confident that your code works, launch an sbatch
job from an SSH session in the cluster to run the code as a script on all your data. This ensures that heavy duty resources that you requested are actually put to use to run your heavy calculations and not seating idle while you are thinking, typing, etc.
Logging on to the cluster
Open a terminal emulator:
Windows users: launch MobaXTerm.
MacOS users: launch Terminal.
Linux users: launch xterm or the terminal emulator of your choice.
Then access the cluster through secure shell:
$ ssh <username>@<hostname> # enter password
Accessing Julia
This is done with the Lmod tool through the module command. You can find the full documentation here and below are the subcommands you will need:
# get help on the module command
$ module help
$ module --help
$ module -h
# list modules that are already loaded
$ module list
# see which modules are available for Julia
$ module spider julia
# see how to load julia 1.3
$ module spider julia/1.3.0
# load julia 1.3 with the required gcc module first
# (the order is important)
$ module load gcc/7.3.0 julia/1.3.0
# you can see that we now have Julia loaded
$ module list
Copying files to the cluster
We will create a julia_workshop
directory in ~/scratch
, then copy our julia script in it.
$ mkdir ~/scratch/julia_job
Open a new terminal window and from your local terminal (make sure that you are not on the remote terminal by looking at the bash prompt) run:
$ scp /local/path/to/sort.jl <username>@<hostname>:scratch/julia_job
$ scp /local/path/to/psort.jl <username>@<hostname>:scratch/julia_job
# enter password
Job scripts
We will not run an interactive session with Julia on the cluster: we already have julia scripts ready to run. All we need to do is to write job scripts to submit to Slurm, the job scheduler used by the Alliance clusters.
We will create 2 scripts: one to run Julia on one core and one on as many cores as are available.
Your turn:
How many processors are there on our training cluster?
We can run Julia with multiple threads by running:
$ JULIA_NUM_THREADS=2 julia
or:
$ julia -t 2
Once in Julia, you can double check that Julia does indeed have access to 2 threads by running:
Threads.nthreads()
Save your job scripts in the files ~/scratch/julia_job/job_julia1c.sh
and job_julia2c.sh
for one and two cores respectively.
Here is what our single core Slurm script looks like:
#!/bin/bash
#SBATCH --job-name=julia1c # job name
#SBATCH --time=00:01:00 # max walltime 1 min
#SBATCH --cpus-per-task=1 # number of cores
#SBATCH --mem=1000 # max memory (default unit is megabytes)
#SBATCH --output=julia1c%j.out # file name for the output
#SBATCH --error=julia1c%j.err # file name for errors
# %j gets replaced with the job number
echo Running NON parallel script
julia sort.jl
echo Running parallel script on $SLURM_CPUS_PER_TASK core
julia -t $SLURM_CPUS_PER_TASK psort.jl
Your turn:
Write the script for 2 cores.
Now, we can submit our jobs to the cluster:
$ cd ~/scratch/julia_job
$ sbatch job_julia1c.sh
$ sbatch job_julia2c.sh
And we can check their status with:
$ sq # This is an Alliance alias for `squeue -u $USER $@`
PD
stands for pending
R
stands for running