library(bench)
Optimizations
A lot of hardware is not the answer to poorly written code. Before considering parallelization, you should think about ways to optimize your code sequentially.
Why?
- not all code can be parallelized,
- parallelization is costly (overhead of parallelization and, if you use a supercomputer, waiting time to access an Alliance cluster or money spent on a commercial cloud),
- the optimization of the sequential code will also benefit the parallel code.
In many cases, writing better code will save you more computing time than parallelization.
In this section, we will cover several principles by playing with the programmatic implementation of the fizz buzz game.
Toy example
Fizz buzz is a children game to practice divisions. Players take turn counting out loud while replacing:
- any number divisible by 3 with the word “Fizz”,
- any number divisible by 5 with the word “Buzz”,
- any number divisible by both 3 and 5 with the word “FizzBuzz”.
Let’s write functions that output series from 1
to n
following these rules and time them to draw general principles about code efficiency.
Setup
First of all, we need to load the necessary modules:
module load StdEnv/2020 gcc/11.3.0 r/4.3.1
Then we need to launch a job.
Interactive job
If there are few of us, we will use interactive sessions with one CPU each with:
salloc --time=2:00:00 --mem-per-cpu=3500M
We can then launch R and load the benchmarking package we will use throughout this section:
Batch jobs
If there are more of us than there are CPUs in the cluster, we will run batch jobs. In this Case:
- Create an R script called
optim.R
with the code to run (you can reuse the same script for all sections on this page by editing it). Don’t forget to load the packagebench
in your script. - Create a bash script called
optim.sh
with the following:
<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.3.1
Rscript <your_script>.R
- Run the jobs with:
sbatch optim.sh
Optimizations
Pre-allocate memory
In order to store the results of a loop, we need to create an object and assign to it the result of the loop at each iteration. In this first function, we create an empty object z
of class integer and of length 0
for that purpose:
<- function(n) {
f1 <- integer()
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
The second function is similar, but this time, we initialize z
with its final length. This means that we are pre-allocating memory for the full vector before we run the loop instead of growing the vector at each iteration:
<- function(n) {
f2 <- integer(n)
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
Let’s make sure that our functions work by testing it on a small number:
f1(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
f2(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Now, let’s benchmark them for a large number:
<- 1e5
n mark(f1(n), f2(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f1(n) 155ms 219ms 4.79 16.55MB 16.0
2 f2(n) 126ms 150ms 6.30 1.15MB 20.5
f2()
is consistently faster, although very slightly. In many cases, the difference you will find will be a lot greater.
Note also the large difference in memory allocation.
No, loops are not a big ‘no no’
By now, you might be thinking: “Wait… aren’t loops a big ‘no no’ in R? I’ve always been told that they are slow and that one should always use functional programming! We are talking about optimization in this course and we are using loops?!?”
There are a lot of misconceptions around R loops. They can be very slow if you don’t pre-allocate memory. Otherwise they are almost always faster than functions (the apply()
family or the tidyverse equivalent of the purrr::map()
family). You can choose to use a functional programming approach for style and readability, but not for speed.
Let’s test it.
First we create a function:
<- function(n) {
f3 if(n %% 3 == 0 && n %% 5 == 0) {
"FizzBuzz"
else if(n %% 3 == 0) {
} "Fizz"
else if(n %% 5 == 0) {
} "Buzz"
else {
}
n
} }
Then we pass it through sapply()
. We can test that it works on a small number:
sapply(1:20, f3)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Finally, we compare the timing with that of f2()
:
mark(f2(n), sapply(1:n, f3))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f2(n) 259ms 261ms 3.83 1.15MB 15.3
2 sapply(1:n, f3) 188ms 199ms 4.88 3.29MB 21.2
As you can see, the loop is faster.
Avoid unnecessary operations
Example 1
Calling z
as the last command in our function is the same as calling return(z)
.
From the R documentation:
If the end of a function is reached without calling return, the value of the last evaluated expression is returned.
Now, what about using print()
instead?
<- function(n) {
f4 <- integer(n)
z for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}print(z)
}
Let’s benchmark it against f2()
:
mark(f2(n), f4(n))
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz" "Fizz" "22" "23" "Fizz"
[25] "Buzz" "26" "Fizz" "28" "29" "FizzBuzz"
[31] "31" "32" "Fizz" "34" "Buzz" "Fizz"
[37] "37" "38" "Fizz" "Buzz" "41" "Fizz"
...
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 f2(n) 131ms 139ms 6.71 NA 21.8 4 13 596ms
2 f4(n) 405ms 411ms 2.43 NA 8.52 2 7 822ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
f4()
is 3 times slower.
What happened?
print()
returns its argument, but it additionally prints it to the standard output. This is why the mark()
function printed the output of f4()
before printing the timings.
As you can see, printing takes a long time.
If you are evaluating f2()
on its own (e.g. f2(20)
), the returned result will also be printed to standard output and both functions will be equivalent. However, if you are using the function in another context, printing becomes an unnecessary and timely operation and f4()
would be a very bad option. f4()
is thus not a good function.
Here is an example in which f4()
would perform a totally unnecessary operation that f2()
avoids:
<- f2(20) a
No unnecessary printing.
<- f4(20) a
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
Unnecessary printing.
Even worse would be to use:
<- function(n) {
f5 for(i in 1:n) {
if(i %% 3 == 0 && i %% 5 == 0) {
print("FizzBuzz")
else if(i %% 3 == 0) {
} print("Fizz")
else if(i %% 5 == 0) {
} print("Buzz")
else {
} print(i)
}
}
}
mark(f2(n), f4(n), check = F)
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 f2(n) 132.8ms 141.69ms 6.77 NA 25.4 4 15
2 f5(n) 1.65s 1.65s 0.606 NA 12.7 1 21
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
We have to disable the check here because the results of the two functions are not technically the same (but we don’t care because, in both cases, the series gets created and that’s what we want).
Here the difference in timing is a factor of 12!
Example 2
One modulo operation and equality test can be removed by replacing i %% 3 == 0 && i %% 5 == 0
by i %% 15 == 0
. We now have three modulo operations and equality tests per iteration instead of four. This gives us a little speedup:
<- function(n) {
f6 <- integer(n)
z for(i in 1:n) {
if(i %% 15 == 0) {
<- "FizzBuzz"
z[i] else if(i %% 3 == 0) {
} <- "Fizz"
z[i] else if(i %% 5 == 0) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z
}
mark(f2(n), f6(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f2(n) 196ms 205ms 4.75 1.15MB 17.4
2 f6(n) 169ms 186ms 5.39 1.22MB 18.0
But we can remove an additional modulo operation and equality test at each iteration by assigning i %% 3 == 0
and i %% 5 == 0
to variables:
<- function(n) {
f7 <- integer(n)
z for(i in 1:n) {
<- (i %% 3 == 0)
div3 <- (i %% 5 == 0)
div5 if(div3 && div5) {
<- "FizzBuzz"
z[i] else if(div3) {
} <- "Fizz"
z[i] else if(div5) {
} <- "Buzz"
z[i] else {
} <- i
z[i]
}
}
z }
Now we only have two modulo operations and equality tests per iteration and we get another little speedup:
mark(f6(n), f7(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f6(n) 178ms 179ms 5.53 1.15MB 18.4
2 f7(n) 162ms 163ms 5.96 1.22MB 15.9
Example 3
We can assign 1:n
to z
instead of initializing it as an empty vector, thus rendering the assignment of i
to z[i]
in the last else statement unnecessary:
<- function(n) {
f8 <- 1:n
z for(i in z) {
<- (i %% 3 == 0)
div3 <- (i %% 5 == 0)
div5 if(div3 && div5) {
<- "FizzBuzz"
z[i] else if(div3) {
} <- "Fizz"
z[i] else if(div5) {
} <- "Buzz"
z[i]
}
}
z }
This function works:
f8(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
and we get a really good speedup here:
mark(f7(n), f8(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f7(n) 127.9ms 146.2ms 6.46 1.15MB 19.4
2 f8(n) 76.4ms 87.7ms 11.3 1.15MB 22.6
Vectorize whenever possible
We can actually get rid of the loop and use a vectorized approach.
<- function(n) {
f9 <- 1:n
z <- (z %% 3 == 0)
div3 <- (z %% 5 == 0)
div5 <- "Fizz"
z[div3] <- "Buzz"
z[div5] & div5)] <- "FizzBuzz"
z[(div3
z }
This still give us the same result:
f9(20)
[1] "1" "2" "Fizz" "4" "Buzz" "Fizz"
[7] "7" "8" "Fizz" "Buzz" "11" "Fizz"
[13] "13" "14" "FizzBuzz" "16" "17" "Fizz"
[19] "19" "Buzz"
mark(f8(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f8(n) 73ms 83ms 11.7 1.15MB 25.3
2 f9(n) 23.4ms 44ms 24.6 5.62MB 1.89
The speedup of 3.8 shows how important it is to use vectorization whenever possible.
Replace costly operations where possible
Sometimes, it isn’t obvious that one method will be faster than another. Benchmarking alternative expressions can teach you which ones are faster.
For instance, it is much faster to index a column from a dataframe by its name (e.g. dataframe$column1
) than by using list indexing (e.g. dataframe[[1]]
).
Sometimes, packages exist which bring much more efficiency than can be achieved with base R. In the case of data frames for example, there is data.table.
Conclusion
Starting from our first function f1()
, we have gained a speedup of 7.4, simply by writing better code and without using parallelization and additional hardware:
mark(f1(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f1(n) 174.4ms 195.6ms 4.79 16.55MB 19.1
2 f9(n) 26.7ms 48.2ms 21.6 5.57MB 1.96
If we used a silly function such as f5()
as our starting function, the speedup would be 370.