R on HPC clusters

Author

Marie-Hélène Burle

In this section, you will learn how to use R on an Alliance cluster: load modules, install packages, and run jobs.

Modules

On the Alliance clusters, a number of utilities are available right away (e.g. Bash utilities, git, tmux, various text editors). Before you can use more specialized software however, you have to load the module corresponding to the version of your choice as well as any potential dependencies.

Modules already loaded

To see the list of loaded modules, run:

module list

As you can see, some modules get loaded by default.

R

First, of course, we need an R module.

To see which versions of R are available on a cluster, run:

module spider r

To see the dependencies of a particular version (e.g. r/4.5.0), run:

module spider r/4.5.0

This shows us that we need StdEnv/2023 to load r/4.5.0.

Your turn:

Check whether StdEnv/2023 is already loaded or whether we need to load it.

C compiler

If you plan on installing any R package, you will also need a C compiler.

A module for gcc (the GNU project C and C++ compiler—part of the the GNU Compiler Collection (GCC)) is already loaded by default, so you don’t have anything to do. You can double-check that a gcc module is loaded by running module list.

Your turn:

  • Which gcc version is currently loaded in the session?
  • How can you check which gcc versions are available on our training cluster?
  • What are the dependencies required by gcc version 13.3`?

Loading modules

Once you know which modules you need, you can load them:

module load r/4.5.0

If you are loading dependencies, the order is important: the dependencies must be listed before the modules which depend on them. Here, we aren’t loading dependencies so the order doesn’t matter.

Your turn:

How can you replace the currently loaded gcc version by version 13.3?

Shell keybindings

Whether you are in Bash, Zsh, the Julia REPL, the Python shell, or the R Console, you can use the same set of keybindings which come from the text editor Emacs.

It is worth learning them because they are so ubiquitous and will make working in any shell a lot more convenient:

tab        auto-complete command
C-l        clear the terminal
C-p        navigate the command history backward
C-n        navigate the command history forward
C-a        go to the beginning of the line
C-e        go to the end of the line
C-k        delete to the end of the line
C-u        delete to the beginning of the line
C-f        go forward one character
C-b        go backward one character
M-f        go forward one word
M-b        go backward one word

C-l means: press the Ctrl (Windows) or Command key ⌘ (macOS) and l keys at the same time.

M-f means: press the Alt (Windows) or Option (macOS) and f keys at the same time.

Installing R packages

For this course, all packages have already been installed in a communal library to save us time and avoiding putting stress on the login node by all installing packages at the same time. The section below is thus for reference only.

To install a package, launch the interactive R console with:

R

In the R console, run:

install.packages("<package_name>", repos="<url-cran-mirror>")

or, to install multiple packages at once:

install.packages(c("<package1>", "<package2>", "<package3>"), repos="<url-cran-mirror>")

For the repos argument, chose a CRAN mirror close to the location of your cluster from this list or use https://cloud.r-project.org/.

Example (please don’t run it since I already pre-installed all packages):

install.packages(c("bench", "memoise"), repos="https://muug.ca/mirror/cran/")

The first time you install a package, R will ask you whether you want to create a personal library in your home directory. Answer yes to both questions. Your packages will now install under ~/.

Some packages require additional modules to be loaded before they can be installed. Other packages need additional R packages as dependencies. In either case, you will get explicit error messages. Adding the argument dependencies = T helps in the second case, but you will still have to add packages manually from time to time.

To leave the R console, press <Ctrl+D>.

Using the help function

You are probably already familiar with the help() function which opens the internal R documentation. When you use it in RStudio, the documentation opens in a dedicated pane. If you use R outside RStudio, you might have configured it so that it opens in your browser. When you use R on the command line via SSH, the documentation will open directly in the terminal.

It uses a small application called a pager (usually less). Below are keybindings that will help you navigate the pager (and probably most importantly close it!).

Useful keybindings when you are in the pager:

SPACE      scroll one screen down
b          back one screen
q          quit the pager
g          go to the top of the document
7g         go to line 7 from the top
G          go to the bottom of the document
/          search for a term
           n will take you to the next result
           N to the previous result

Running R jobs

There are two types of jobs that can be launched on an Alliance cluster: interactive jobs and batch jobs. We will practice both and discuss their respective merits and when to use which.

For this course, I purposefully built a rather small cluster (10 nodes with 4 CPUs and 15GB each) to give a tangible illustration of the constraints of resource sharing.

Interactive jobs

While it is fine to run R on the login node when you install packages, you must start a SLURM job before any heavy computation.

To run R interactively, you should launch an salloc session.

Example to launch an interactive job on a single CPU with 3500MB of memory for 2h:

salloc --time=2:00:00 --mem-per-cpu=3500M

This takes you to a compute node where you can now launch R to run computations:

R

This however leads to the same inefficient use of resources as happens when running an RStudio server: all the resources that you requested are allocated to you while your job is running, whether you are making use of them (running heavy computations) or not (thinking, typing code, running computations that use only a fraction of the requested resources).

Interactive jobs are thus best kept to prototype code.

Scripts

To run an R script called <your_script>.R, you first need to write a job script:

Example to run a script on 4 CPUs with 3500MB per CPU for 15min:

<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load r/4.5.0
Rscript <your_script>.R

Note that R scripts are run with the command Rscript (not R).

Then launch your job with:

sbatch <your_job>.sh

You can monitor your job with sq (an alias for squeue -u $USER $@).

Batch jobs are the best approach to run parallel computations, particularly when they require a lot of hardware.

It will save you lots of waiting time (Alliance clusters) or money (commercial clusters). Once you have prototyped your code interactively, turn to this method to run things at scale.