library(foreach)
library(iterators)High-performance research computing in 
Content from the webinar slides for easier browsing.
Running R on HPC clusters
Loading modules
Intel vs GCC compilers
To compile R packages, you need a C compiler.
In theory, you could use the proprietary Intel compiler which is loaded by default on the Alliance clusters, but it is recommended to replace it with the GCC compiler (R packages can even be compiled with Clang and LLVM, but the default GCC compiler is the best way to avoid headaches).
It is thus much simpler to always load a gcc module before loading an r module.
R module
To see what versions of R are available on a cluster, run:
module spider rTo see the dependencies of a particular version (e.g. r/4.2.1), run:
module spider r/4.2.1StdEnv/2020 is a required module for this version.
On most Alliance clusters, it is automatically loaded, so you don’t need to include it. You can double-check with module list or you can include it (before r/4.2.1) just to be sure.
Finally, load your modules:
module load StdEnv/2020 gcc/11.3.0 r/4.2.1Installing R packages
To install a package, launch the interactive R console with:
RIn the R console, run:
install.packages("<package_name>", repos="<url-cran-mirror>")repos argument: chose a CRAN mirror close to the location of your cluster or use https://cloud.r-project.org/.
The first time you install a package, R will ask you whether you want to create a personal library in your home directory. Answer yes to both questions. Your packages will now install under ~/.
Some packages require additional modules to be loaded before they can be installed. Other packages need additional R packages as dependencies. In either case, you will get explicit error messages. Adding the argument dependencies = TRUE helps in the second case, but you will still have to add packages manually from time to time.
Let’s install the packages needed for this webinar:
install.packages(
c("tidyverse", "bench", "doFuture", "doRNG", "randomForest", "Rcpp"),
repos="https://mirror.rcg.sfu.ca/mirror/CRAN/" # closest mirror from Cedar
)This will also install the dependencies foreach, future, and iterators.
To leave the R console, press <Ctrl+D>.
Running R jobs
Scripts
To run an R script called <your_script>.R, you first need to write a job script:
Example:
<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3000M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.2.1
Rscript <your_script>.R # Note that R scripts are run with the command `Rscript`Then launch your job with:
sbatch <your_job>.shYou can monitor your job with sq (an alias for squeue -u $USER $@).
Interactive jobs
While it is fine to run R on the login node when you install packages, you must start a SLURM job before any heavy computation.
To run R interactively, you should launch an salloc session.
Here is what I will use for this webinar:
salloc --time=1:10:00 --mem-per-cpu=7000M --ntasks=8This takes me to a compute node where I can launch R to run computations:
RPerformance
Profiling
The first thing to do if you want to improve your code efficiency is to identify bottlenecks in your code. Common tools are:
- the base R function
Rprof() - the package profvis
profvis is a newer tool, built by posit (formerly RStudio). Under the hood, it runs Rprof() to collect data, then produces an interactive html widget with a flame graph that allows for an easy visual identification of slow sections of code. While this tool integrates well within the RStudio IDE or the RPubs ecosystem, it is not very well suited for remote work on a cluster. One option is to profile your code with small data on your own machine. Another option is to use the base profiler with Rprof() directly as in this example.
Benchmarking
Once you have identified expressions that are particularly slow, you can use benchmarking tools to compare variations of the code.
In the most basic fashion, you can use system.time(), but this is limited and imprecise.
The microbenchmark package is a popular option.
It gives the minimum time, lower quartile, mean, median, upper quartile, and maximum time of R expressions.
The newer bench package has less overhead, is more accurate, and—for sequential code—gives information on memory usage and garbage collections. This is the package I will use today.
Parallel programming
Multi-threading
We talk about multi-threading when a single process (with its own memory) runs multiple threads.
The execution can happen in parallel—if each thread has access to a CPU core—or by alternating some of the threads on some CPU cores.
Because all threads in a process write to the same memory addresses, multi-threading can lead to race conditions.
Multi-threading does not seem to be a common approach to parallelizing R code.
Multi-processing in distributed memory
When processes involved in the execution of some code run on multiple nodes of a cluster, messages between them need to travel over the cluster interconnect. In that case, we talk about distributed memory.
Running R code in parallel
Package parallel (base R)
The parallel package has been part of the “base” package group since version 2.14.0.
This means that it is comes with R.
Most parallel approaches in R build on this package.
We will make use of it to create and close an ad-hoc cluster.
The parallelly package adds functionality to the parallel package.
Package foreach
The foreach package implements a looping construct without an explicit counter. It doesn’t require the preallocation of an output container, it brings to R an equivalent of the Python or Julia list comprehensions, and mostly, it allows for an easy execution of loops in parallel. Unlike loops, it creates variables (loops are used for their side-effect).
Let’s look at an example to calculate the sum of 1e4 random vectors of length 3.
We will use foreach and iterators (which creates convenient iterators for foreach):
Classic while loop:
set.seed(2)
result1 <- numeric(3) # Preallocate output container
i <- 0 # Initialise counter variable
while(i < 1e4) { # Finally we run the loop
result1 <- result1 + runif(3) # Calculate the sum
i <- i + 1 # Update the counter
}With foreach:
set.seed(2)
result2 <- foreach(icount(1e4), .combine = '+') %do% runif(3)Verify:
all.equal(result1, result2)[1] TRUE
The best part of foreach is that you can turn sequential loops into parallel ones by registering a parallel backend and replacing %do% with %dopar%.
There are many parallelization backends available: doFuture, doMC, doMPI, doFuture, doParallel, doRedis, doRNG, doSNOW, and doAzureParallel.
In this webinar, I will use doFuture which allows to evaluate foreach expressions following any of the strategies of the future package.
So first, what is the future package?
Package future
A future is an object that acts as an abstract representation for a value in the future. A future can be resolved (if the value has been computed) or unresolved. If the value is queried while the future is unresolved, the process is blocked until the future is resolved.
Futures allow for asynchronous and parallel evaluations. The future package provides a simple and unified API to evaluate futures.
Plans
The future package does this thanks to the plan function:
plan(sequential): futures are evaluated sequentially in the current R sessionplan(multisession): futures are evaluated by new R sessions spawned in the background (multi-processing in shared memory)plan(multicore): futures are evaluated in processes forked from the existing process (multi-processing in shared memory)plan(cluster): futures are evaluated on an ad-hoc cluster (allows for distributed parallelism across multiple nodes)
Consistency
To ensure a consistent behaviour across plans, all evaluations are done in a local environment:
library(future)
a <- 1
b %<-% {
a <- 2
}
a[1] 1
Let’s return to our example
We had:
set.seed(2)
result2 <- foreach(icount(1e4), .combine = '+') %do% runif(3)We can replace %do% with %dopar%:
set.seed(2)
result3 <- foreach(icount(1e4), .combine = '+') %dopar% runif(3)Since we haven’t registered any parallel backend, the expression will still be evaluated sequentially.
To run this in parallel, we need to load doFuture, register it as a backend (with registerDoFuture()), and choose a parallel strategy (e.g. plan(multicore)):
library(foreach)
library(doFuture)
registerDoFuture()
plan(multicore)
set.seed(2)
result3 <- foreach(icount(1e4), .combine = '+') %dopar% runif(3)With the overhead of parallelization, it actually doesn’t make sense to parallelize such a short code, so let’s go over a toy example and do some benchmarking.
Toy example
Load packages
For this toy example, I will use a modified version of one of the examples in the foreach vignette: I will b uild a classification model made of a forest of decision trees thanks to the randomForest package.
Because the code includes randomly generated numbers, I will use the doRNG package which replaces foreach::%dopar% wit h doRNG::%dorng%. This follows the recommendations of Pierre L’Ecuyer (1999)1 and ensures reproducibility.
library(doFuture) # This will also load the `future` package
library(doRNG) # This will also load the `foreach` package
library(randomForest)
library(bench) # To do some benchmarkingLoading required package: foreach
Loading required package: future
Loading required package: rngtools
The code to parallelize
The goal is to create a classifier based on some data (here a matrix of random numbers for simplicity) and a response variable (as factor). This model could then be passed in the predict() function with novel data to generate predictions of classification. But here we are only interested in the creation of the model as this is the part that is computationally intensive. We aren’t interested in actually using it.
set.seed(11)
traindata <- matrix(runif(1e5), 100)
fac <- gl(2, 50)
rf <- foreach(ntree = rep(250, 8), .combine = combine) %do%
randomForest(x = traindata, y = fac, ntree = ntree)
rfCall:
randomForest(x = traindata, y = fac, ntree = ntree)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 31
Reference timing
This is the non parallelizable code with %do%:
tref <- mark(
rf1 <- foreach(ntree = rep(250, 8), .combine = combine) %do%
randomForest(x = traindata, y = fac, ntree = ntree),
memory = FALSE
)
tref$median[1] 5.66s
Plan sequential
This is the parallelizable foreach code, but run sequentially:
registerDoFuture() # Set the parallel backend
plan(sequential) # Set the evaluation strategy
# Using bench::mark()
tseq <- mark(
rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
randomForest(x = traindata, y = fac, ntree = ntree),
memory = FALSE
)
tseq$median[1] 5.78s
No surprise: those are similar.
Plan multisession
Shared memory multi-processing can be run with plan(multisession) that will spawn new R sessions in the background to evaluate futures:
plan(multisession)
tms <- mark(
rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
randomForest(x = traindata, y = fac, ntree = ntree),
memory = FALSE
)
tms$median[1] 2s
We got a speedup of 5.78 / 2 = 2.9.
Plan multicore
Shared memory multi-processing can also be run with plan(multicore) (except on Windows) that will fork the current R process to evaluate futures:
plan(multicore)
tmc <- mark(
rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
randomForest(x = traindata, y = fac, ntree = ntree),
memory = FALSE
)
tmc$median[1] 1.9s
We got a very similar speedup of 5.78 / 1.9 = 3.0.
Multi-processing in distributed memory
I requested 8 tasks from Slurm on a training cluster made of nodes with 4 CPU cores each. Let’s verify that I got them by accessing the SLURM_NTASKS environment variable from within R:
as.numeric(Sys.getenv("SLURM_NTASKS"))[1] 8
I can create a character vector with the name of the node each task is running on:
(hosts <- system("srun hostname | cut -f 1 -d '.'", intern = TRUE))chr [1:8] "node1" "node1" "node1" "node1" "node2" "node2" "node2" "node2"
This allows me to create a cluster of workers:
(cl <- parallel::makeCluster(hosts)) # Defaults to type="PSOCK"socket cluster with 8 nodes on hosts ‘node1’, ‘node2’
Plan cluster
I can now try the code with distributed parallelism using all 8 CPU cores across both nodes:
plan(cluster, workers = cl)
tdis <- mark(
rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
randomForest(x = traindata, y = fac, ntree = ntree),
memory = FALSE
)
tdis$median[1] 1.14s
Speedup: 5.78 / 1.14 = 5.1.
The cluster of workers can be stopped with:
parallel::stopCluster(cl)Alternative approaches
The multidplyr package partitions data frames across worker processes, allows you to run the usual tidyverse functions on each partition, then collects the processed data.
The furrr package is a parallel equivalent to the purrr package from the tidyverse.
If you work with genomic data, you might want to have a look at the BiocParallel package from Bioconductor.
Yet another option to run distributed R code is to use the sparklyr package (an R interface to Spark).
Rmpi is a wrapper to MPI (Message-Passing Interface). It has proved slow and problematic on Cedar though.
The boot package provides functions and datasets specifically for bootstrapping in parallel.
Write C++ with Rcpp
When?
- Code that cannot easily be parallelized (e.g. multiple recursive function calls)
- Large number of function calls
- Need for data structures missing in R
- Creation of efficient packages
How?
Rcpp provides C++ classes with mappings to R’s .Call(). C++ functions can be declared in source files or directly in R scripts.