library(bench)
Optimizations
A lot of hardware is not the answer to poorly written code. Before considering parallelization, you should think about ways to optimize your code sequentially.
Why?
- Not all code can be parallelized.
- Parallelization is costly (overhead of parallelization and, if you use a supercomputer, waiting time to access an Alliance cluster or money spent on a commercial cloud).
- The optimization of the sequential code will also benefit the parallel code.
In many cases, writing better code will save you more computing time than parallelization.
In this section, we will cover several principles by playing with the programmatic implementation of the fizz buzz game.
Toy example
Fizz buzz is a children game to practice divisions. Players take turn counting out loud from “1” while replacing:
- any number divisible by 3 with the word “Fizz”,
- any number divisible by 5 with the word “Buzz”,
- any number divisible by both 3 and 5 with the word “FizzBuzz”.
This creates a series that starts with: "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16"
, etc.
Let’s write functions that output series from 1
to n
following these rules and time them to draw general principles about code efficiency.
Setup
First of all, we need to load the necessary modules:
module load StdEnv/2023 gcc/13.3 r/4.4.0
Then we need to launch a job. There are 2 options:
Interactive job
If there are few of us, we will use an interactive session with one CPU each. To launch it, run the following (in the Bash terminal, not in R):
salloc --time=2:00:00 --mem-per-cpu=3500M
We can then launch R:
R
Now, we load the benchmarking package that we will use throughout this section:
Batch jobs
If there are more of us than there are CPUs in the cluster, we will run batch jobs. In this Case:
- Create an R script called
optim.R
with the code to run (you can reuse the same script for all sections on this page by editing it). Don’t forget to load the packagebench
in your script. - Create a bash script called
optim.sh
with the following:
optim.sh
#!/bin/bash
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
module load StdEnv/2023 gcc/13.3 r/4.4.0
Rscript optim.R
- Run the jobs with:
sbatch optim.sh
Optimizations
Proper memory pre-allocation
In order to store the results of a loop, we need to create an object and assign to it the result of the loop at each iteration. In this first function, we create an empty object z
of class integer and of length 0
for that purpose:
<- function(n) {
f1 <- integer()
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
The second function is similar, but this time, we initialize z
with its final length. This means that we are pre-allocating memory for the full vector before we run the loop instead of growing the vector at each iteration:
<- function(n) {
f2 <- integer(n)
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
Let’s make sure that our functions work by testing it on a short series:
f1(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
f2(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Short series are good to get a feel for what our functions return, but they would be inadequate for benchmarking because the functions would run too fast and the timing differences would be too small. Always make sure that your function runs are long enough when you benchmark.
Let’s pick a bigger value for n
:
<- 1e5 n
Now, we can benchmark our functions:
mark(f1(n), f2(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f1(n) 306ms 311ms 3.22 16.55MB 11.3
2 f2(n) 214ms 219ms 4.58 1.15MB 16.8
f2()
is consistently faster, although not by much (speedup of 1.4). In many cases, the difference you will find will be a lot greater.
In the cluster, because memory is allocated outside of R (by Slurm), it is not tracked by mark()
(see documentation).
The output you can see on this site was obtained on my laptop. It shows that a properly written function with pre-allocated memory uses 14 times less memory.
Now, notice how our function actually returns a character and not an integer:
typeof(f2(n))
[1] "character"
So let’s create the object z
, which will hold the results of our loop, directly of the proper type:
<- function(n) {
f3 <- character(n)
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
And now for the benchmark against f2()
:
mark(f2(n), f3(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f2(n) 214ms 219ms 4.56 1.23MB 15.2
2 f3(n) 217ms 219ms 4.51 863.8KB 16.5
You can see that there is no difference in timing, but that f3()
is still slightly better because it uses a little less memory. This shows that type matters, but the most important thing you want to worry about in memory pre-allocation is the final length of your objects.
No, loops are not a big ‘no no’
By now, you might be thinking: “Wait… aren’t loops a big ‘no no’ in R? I’ve always been told that they are slow and that one should always use functional programming! We are talking about optimization in this course and we are using loops?!?”
There are a lot of misconceptions around R loops. They can be very slow if you don’t pre-allocate memory. Otherwise they are almost always faster than functions (the apply()
family or the tidyverse equivalent of the purrr::map()
family). You can choose to use a functional programming approach for style and readability, but not for speed.
Let’s test it.
First we create a function:
<- function(n) {
f4 if(n %% 3 == 0 && n %% 5 == 0) {
"FizzBuzz"
else if(n %% 3 == 0) {
} "Fizz"
else if(n %% 5 == 0) {
} "Buzz"
else {
}
n
} }
Then we have to pass our function through sapply()
.
Let’s make sure that the code works:
sapply(1:20, f4)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Now, we compare the timing with that of f3()
(our fastest function so far):
mark(f3(n), sapply(1:n, f4))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f3(n) 262ms 267ms 3.75 781.3KB 16.9
2 sapply(1:n, f4) 355ms 372ms 2.68 3.29MB 12.1
As you can see, the loop is faster (speed up of 1.4). On my laptop, it also used 4 times less memory.
Avoid unnecessary operations
Example 1
Calling z
as the last command in our function is the same as calling return(z)
.
From the R documentation:
If the end of a function is reached without calling return, the value of the last evaluated expression is returned.
Now, what about using print()
instead?
<- function(n) {
f5 <- character(n)
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}print(z)
}
Let’s test that it works:
f5(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Now, let’s benchmark it against f3()
(still our fastest function so far):
mark(f3(n), f5(n))
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz" "Fizz" "22" "23" "Fizz"
[25] "Buzz" "26" "Fizz" "28" "29" "FizzBuzz"
[31] "31" "32" "Fizz" "34" "Buzz" "Fizz"
[37] "37" "38" "Fizz" "Buzz" "41" "Fizz"
...
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 f3(n) 116ms 120ms 7.45 NA 26.1 4 14 537ms
2 f5(n) 925ms 925ms 1.08 NA 3.24 1 3 925ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
f5()
is 7.7 times slower.
What happened?
print()
returns its argument, but it additionally prints it to the standard output. This is why the mark()
function printed the output of f5()
before printing the timings.
As you can see, printing takes a long time.
If you are evaluating f3()
on its own (e.g. f3(20)
), the returned result will also be printed to standard output and both functions will be equivalent. However, if you are using the function in another context, printing becomes an unnecessary and timely operation and f5()
would be a very bad option. f5()
is thus not a good function.
Here is an example in which f5()
would perform a totally unnecessary operation that f3()
avoids:
<- f3(20) a
No unnecessary printing.
<- f5(20) a
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Unnecessary printing.
Even worse would be to use:
<- function(n) {
f6 for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
print("FizzBuzz")
else if(i %% 3 == 0) {
} print("Fizz")
else if(i %% 5 == 0) {
} print("Buzz")
else {
} print(i)
}
} }
Let’s test it:
f6(20)
[1] 1
[1] 2
[1] "Fizz"
[1] 4
[1] "Buzz"
[1] "Fizz"
[1] 7
[1] 8
[1] "Fizz"
[1] "Buzz"
[1] 11
[1] "Fizz"
[1] 13
[1] 14
[1] "FizzBuzz"
[1] 16
[1] 17
[1] "Fizz"
[1] 19
[1] "Buzz"
The values are correct, although the output is of a different type (NULL
instead of a character
since our function didn’t return anything and the values got printed as a side effect of the for loop).
Benchmark against f3()
:
mark(f3(n), f6(n), check = F)
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 f3(n) 105ms 108ms 9.10 NA 30.9 5 17 549ms
2 f6(n) 6s 6s 0.167 NA 3.34 1 20 6s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
We need to disable the check here because the results of the two functions are not the same.
Here the difference in timing is a factor of 55.5 due to all those printing calls.
Example 2
One modulo operation and equality test can be removed by replacing i %% 3 == 0 && i %% 5 == 0
by i %% 15 == 0
. We now have three modulo operations and equality tests per iteration instead of four. This gives us a little speedup:
<- function(n) {
f7 <- character(n)
z for(i in 1:n) {
if(i %% 15 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
The benchmark with our fastest function f3()
gives:
mark(f3(n), f7(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f3(n) 326ms 330ms 3.03 781KB 12.1
2 f7(n) 311ms 311ms 3.22 852KB 11.3
But we can remove an additional modulo operation and equality test at each iteration by assigning i %% 3 == 0
and i %% 5 == 0
to variables:
<- function(n) {
f8 <- character(n)
z for(i in 1:n) {
<- (i %% 3 == 0)
div3 <- (i %% 5 == 0)
div5 if(div3 && div5) {
<- "FizzBuzz"
z[i] else if(div3) {
} <- "Fizz"
z[i] else if(div5) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
Now we only have two modulo operations and equality tests per iteration and we get another little speedup when we benchmark it against f7()
, our new best function:
mark(f7(n), f8(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f7(n) 183ms 186ms 5.36 781KB 19.7
2 f8(n) 173ms 175ms 5.72 856KB 17.2
Example 3
We can assign 1:n
to z
instead of initializing it as an empty vector, thus rendering the assignment of i
to z[i]
in the last else statement unnecessary:
<- function(n) {
f9 <- 1:n
z for(i in z) {
<- (i %% 3 == 0)
div3 <- (i %% 5 == 0)
div5 if(div3 && div5) {
<- "FizzBuzz"
z[i] else if(div3) {
} <- "Fizz"
z[i] else if(div5) {
} <- "Buzz"
z[i]
}
}
z }
This function works:
f9(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
and we get a little more speedup when compared to f8()
—our current best function:
mark(f8(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f8(n) 180ms 182ms 5.37 781.3KB 17.9
2 f9(n) 130ms 134ms 7.47 1.15MB 18.7
Vectorize whenever possible
We can actually get rid of the loop and use a vectorized approach instead, utilizing what really constitutes the strength of the R language. The following is pure R style at its best:
<- function(n) {
f10 <- 1:n
z <- (z %% 3 == 0)
div3 <- (z %% 5 == 0)
div5 <- "Fizz"
z[div3] <- "Buzz"
z[div5] & div5)] <- "FizzBuzz"
z[(div3
z }
This still give us the same result:
f10(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Now for the benchmark with f9()
(our best function up to this point):
mark(f9(n), f10(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f9(n) 114.6ms 117.8ms 8.11 1.21MB 17.8
2 f10(n) 30.2ms 33.9ms 29.2 5.62MB 1.94
The speedup of 3.5 shows the importance of using vectorization whenever possible.
Replace costly operations where possible
Sometimes, it isn’t obvious that one method will be faster than another. Benchmarking alternative expressions can teach you which ones are faster.
For instance, it is much faster to index a column from a dataframe by its name (e.g. dataframe$column1
) than by using list indexing (e.g. dataframe[[1]]
). Until you test it, there is nothing obvious about this because it has to do with how R processes the data under the hood.
Use faster packages
To achieve the best performance, you should look for efficient packages and learn them.
Packages exist which bring much more efficiency than can be achieved with base R or the tidyverse. In the case of data frames for example, there is data.table.
Conclusion
Starting from our first function f1()
, we have gained a speedup of 6, simply by writing better code and without using parallelization and additional hardware:
mark(f1(n), f10(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f1(n) 279ms 289.6ms 3.45 16.63MB 13.8
2 f10(n) 34.7ms 36.6ms 26.9 5.57MB 1.92
If we used a silly function such as f6()
as our starting function, the speedup would be 333.
Before thinking about running R in parallel or throwing GPUs at your problem, hoping that these would solve the slowness of your code, identify the bottlenecks and rewrite the slow sections more efficiently.