R on HPC clusters

Author

Marie-Hélène Burle

In this section, you will learn how to use R on an Alliance cluster: load modules, install packages, and run jobs.

Modules

On the Alliance clusters, a number of utilities are available right away (e.g. Bash utilities, git, tmux, various text editors). Before you can use more specialized software however, you have to load the module corresponding to the version of your choice as well as any potential dependencies.

Modules already loaded

To see the list of loaded modules, run:

module list

As you can see, some modules get loaded by default.

R

First, of course, we need an R module.

To see which versions of R are available on a cluster, run:

module spider r

To see the dependencies of a particular version (e.g. r/4.5.0), run:

module spider r/4.5.0

This shows us that we need StdEnv/2023 to load r/4.5.0.

Your turn:

Check whether StdEnv/2023 is already loaded or whether we need to load it.

C compiler

If you plan on installing any R package, you will also need a C compiler.

In theory, one could use the proprietary Intel compiler which is loaded by default on the Alliance clusters, but it is recommended to replace it with the GCC compiler (R packages can be compiled by any C compiler—also including Clang and LLVM—but the default GCC compiler is the best way to avoid headaches).

Your turn:

How can you check which gcc versions are available on our training cluster?
What are the dependencies required by gcc/13.3?

Loading the modules

Once you know which modules you need, you can load them.

module load gcc/13.3 r/4.5.0

If you are loading dependencies, the order is important: the dependencies must be listed before the modules which depend on them. Here, we aren’t loading dependencies so the order doesn’t matter.

Installing R packages

For this course, all packages have already been installed in a communal library to save us time and avoiding putting stress on the login node by all installing packages at the same time. The section below is thus for reference only.

To install a package, launch the interactive R console with:

In the R console, run:

install.packages("<package_name>", repos="<url-cran-mirror>")

or, to install multiple packages at once:

install.packages(c("<package1>", "<package2>", "<package3>"), repos="<url-cran-mirror>")

For the repos argument, chose a CRAN mirror close to the location of your cluster from this list or use https://cloud.r-project.org/.

Example (please don’t run it since I already pre-installed all packages):

install.packages(c("bench", "memoise"), repos="https://muug.ca/mirror/cran/")

The first time you install a package, R will ask you whether you want to create a personal library in your home directory. Answer yes to both questions. Your packages will now install under ~/.

Some packages require additional modules to be loaded before they can be installed. Other packages need additional R packages as dependencies. In either case, you will get explicit error messages. Adding the argument dependencies = T helps in the second case, but you will still have to add packages manually from time to time.

To leave the R console, press <Ctrl+D>.

Running R jobs

There are two types of jobs that can be launched on an Alliance cluster: interactive jobs and batch jobs. We will practice both and discuss their respective merits and when to use which.

For this course, I purposefully built a rather small cluster (10 nodes with 4 CPUs and 15GB each) to give a tangible illustration of the constraints of resource sharing.

Interactive jobs

While it is fine to run R on the login node when you install packages, you must start a SLURM job before any heavy computation.

To run R interactively, you should launch an salloc session.

Example to launch an interactive job on a single CPU with 3500MB of memory for 2h:

salloc --time=2:00:00 --mem-per-cpu=3500M

This takes you to a compute node where you can now launch R to run computations:

This however leads to the same inefficient use of resources as happens when running an RStudio server: all the resources that you requested are blocked for you while your job is running, whether you are making use of them (running heavy computations) or not (thinking, typing code, running computations that use only a fraction of the requested resources).

Interactive jobs are thus best kept to develop code.

Scripts

To run an R script called <your_script>.R, you first need to write a job script:

Example to run a script on 4 CPUs with 3500MB per CPU for 15min:

<your_job>.sh

#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2023 gcc/13.3 r/4.5.0
Rscript <your_script>.R

Note that R scripts are run with the command Rscript (not R).

Then launch your job with:

sbatch <your_job>.sh

You can monitor your job with sq (an alias for squeue -u $USER $@).

Batch jobs are the best approach to run parallel computations, particularly when they require a lot of hardware.

It will save you lots of waiting time (Alliance clusters) or money (commercial clusters).