High-performance research computing in

Marie-Hélène Burle

January 31, 2023

Loading modules

Load packages

For this toy example, I will use a modified version of one of the examples in the foreach vignette: I will b uild a classification model made of a forest of decision trees thanks to the randomForest package.

Because the code includes randomly generated numbers, I will use the doRNG package which replaces foreach::%dopar% wit h doRNG::%dorng%. This follows the recommendations of Pierre L’Ecuyer (1999)¹ and ensures reproducibility.

library(doFuture)       # This will also load the `future` package
library(doRNG)          # This will also load the `foreach` package
library(randomForest)
library(bench)          # To do some benchmarking

Loading required package: foreach
Loading required package: future
Loading required package: rngtools

The code to parallelize

The goal is to create a classifier based on some data (here a matrix of random numbers for simplicity) and a response variable (as factor). This model could then be passed in the predict() function with novel data to generate predictions of classification. But here we are only interested in the creation of the model as this is the part that is computationally intensive. We aren’t interested in actually using it.

set.seed(11)
traindata <- matrix(runif(1e5), 100)
fac <- gl(2, 50)

rf <- foreach(ntree = rep(250, 8), .combine = combine) %do%
  randomForest(x = traindata, y = fac, ntree = ntree)

rf

Call:
 randomForest(x = traindata, y = fac, ntree = ntree)
               Type of random forest: classification
                     Number of trees: 2000
No. of variables tried at each split: 31

Reference timing

This is the non parallelizable code with %do%:

tref <- mark(
  rf1 <- foreach(ntree = rep(250, 8), .combine = combine) %do%
    randomForest(x = traindata, y = fac, ntree = ntree),
  memory = FALSE
)

tref$median

[1] 5.66s

Plan sequential

This is the parallelizable foreach code, but run sequentially:

registerDoFuture()   # Set the parallel backend
plan(sequential)     # Set the evaluation strategy

# Using bench::mark()
tseq <- mark(
  rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
    randomForest(x = traindata, y = fac, ntree = ntree),
  memory = FALSE
)

tseq$median

[1] 5.78s

No surprise: those are similar.

Multi-processing in shared memory

future provides availableCores() to detect the number of available cores:

availableCores()

system
     4

Similar to parallel::detectCores().

This detects the number of CPU cores available to me on the current compute node, that is, what I can use for shared memory multi-processing.

Plan multisession

Shared memory multi-processing can be run with plan(multisession) that will spawn new R sessions in the background to evaluate futures:

plan(multisession)

tms <- mark(
  rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
    randomForest(x = traindata, y = fac, ntree = ntree),
  memory = FALSE
)

tms$median

[1] 2s

We got a speedup of 5.78 / 2 = 2.9.

Plan multicore

Shared memory multi-processing can also be run with plan(multicore) (except on Windows) that will fork the current R process to evaluate futures:

plan(multicore)

tmc <- mark(
  rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
    randomForest(x = traindata, y = fac, ntree = ntree),
  memory = FALSE
)

tmc$median

[1] 1.9s

We got a very similar speedup of 5.78 / 1.9 = 3.0.

Multi-processing in distributed memory

I requested 8 tasks from Slurm on a training cluster made of nodes with 4 CPU cores each. Let’s verify that I got them by accessing the SLURM_NTASKS environment variable from within R:

as.numeric(Sys.getenv("SLURM_NTASKS"))

[1] 8

I can create a character vector with the name of the node each task is running on:

(hosts <- system("srun hostname | cut -f 1 -d '.'", intern = TRUE))

chr [1:8] "node1" "node1" "node1" "node1" "node2" "node2" "node2" "node2"

This allows me to create a cluster of workers:

(cl <- parallel::makeCluster(hosts))      # Defaults to type="PSOCK"

socket cluster with 8 nodes on hosts ‘node1’, ‘node2’

Plan cluster

I can now try the code with distributed parallelism using all 8 CPU cores across both nodes:

plan(cluster, workers = cl)

tdis <- mark(
  rf2 <- foreach(ntree = rep(250, 8), .combine = combine) %dorng%
    randomForest(x = traindata, y = fac, ntree = ntree),
  memory = FALSE
)

tdis$median

[1] 1.14s

Speedup: 5.78 / 1.14 = 5.1.

The cluster of workers can be stopped with:

parallel::stopCluster(cl)

High-performance research computing in Marie-Hélène Burle January 31, 2023

High-performance research computing in
Running R on HPC clusters
Loading modules
Intel vs GCC compilers
R module
Installing R packages
To install a package,...
Let’s install the...
Running R jobs
Scripts
Interactive jobs
Performance
Profiling
Benchmarking
Parallel programming
Multi-threading
Multi-processing in shared memory
Multi-processing in distributed memory
Running R code in parallel
Package parallel (base R)
Package foreach
Classic while loop:...
The best part of...
Package future
Plans
Consistency
Let’s return to our example
To run this in parallel,...
Toy example
Load packages
The code to parallelize
Reference timing
Plan sequential
Multi-processing in shared memory
Plan multisession
Plan multicore
Multi-processing in distributed memory
Plan cluster
Alternative approaches
Write C++ with Rcpp
When?