Optimizations

Author

Marie-Hélène Burle

A lot of hardware is not the answer to poorly written code. Before considering parallelization, you should think about ways to optimize your code sequentially.

Why?

  • not all code can be parallelized,
  • parallelization is costly (overhead of parallelization and, if you use a supercomputer, waiting time to access an Alliance cluster or money spent on a commercial cloud),
  • the optimization of the sequential code will also benefit the parallel code.

In many cases, writing better code will save you more computing time than parallelization.

In this section, we will cover several principles by playing with the programmatic implementation of the fizz buzz game.

Toy example

Fizz buzz is a children game to practice divisions. Players take turn counting out loud while replacing:

  • any number divisible by 3 with the word “Fizz”,
  • any number divisible by 5 with the word “Buzz”,
  • any number divisible by both 3 and 5 with the word “FizzBuzz”.

Let’s write functions that output series from 1 to n following these rules and time them to draw general principles about code efficiency.

Setup

First of all, we need to load the necessary modules:

module load StdEnv/2020 gcc/11.3.0 r/4.3.1

Then we need to launch a job.

Interactive job

If there are few of us, we will use interactive sessions with one CPU each with:

salloc --time=2:00:00 --mem-per-cpu=3500M

We can then launch R and load the benchmarking package we will use throughout this section:

library(bench)

Batch jobs

If there are more of us than there are CPUs in the cluster, we will run batch jobs. In this Case:

  • Create an R script called optim.R with the code to run (you can reuse the same script for all sections on this page by editing it). Don’t forget to load the package bench in your script.
  • Create a bash script called optim.sh with the following:
<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.3.1
Rscript <your_script>.R
  • Run the jobs with:
sbatch optim.sh

Optimizations

Pre-allocate memory

In order to store the results of a loop, we need to create an object and assign to it the result of the loop at each iteration. In this first function, we create an empty object z of class integer and of length 0 for that purpose:

f1 <- function(n) {
  z <- integer()
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

The second function is similar, but this time, we initialize z with its final length. This means that we are pre-allocating memory for the full vector before we run the loop instead of growing the vector at each iteration:

f2 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Let’s make sure that our functions work by testing it on a small number:

f1(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    
f2(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Now, let’s benchmark them for a large number:

n <- 1e5
mark(f1(n), f2(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)         139ms    140ms      7.01   16.65MB     26.3
2 f2(n)         123ms    124ms      8.04    1.24MB     30.5

f2() is consistently faster, although very slightly. In many cases, the difference you will find will be a lot greater.

Note also the large difference in memory allocation.

No, loops are not a big ‘no no’

By now, you might be thinking: “Wait… aren’t loops a big ‘no no’ in R? I’ve always been told that they are slow and that one should always use functional programming! We are talking about optimization in this course and we are using loops?!?”

There are a lot of misconceptions around R loops. They can be very slow if you don’t pre-allocate memory. Otherwise they are almost always faster than functions (the apply() family or the tidyverse equivalent of the purrr::map() family). You can choose to use a functional programming approach for style and readability, but not for speed.

Let’s test it.

First we create a function:

f3 <- function(n) {
  if(n %% 3 == 0 && n %% 5 == 0) {
    "FizzBuzz"
  } else if(n %% 3 == 0) {
    "Fizz"
  } else if(n %% 5 == 0) {
    "Buzz"
  } else {
    n
  }
}

Then we pass it through sapply(). We can test that it works on a small number:

sapply(1:20, f3)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Finally, we compare the timing with that of f2():

mark(f2(n), sapply(1:n, f3))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f2(n)              132ms    135ms      7.40    1.15MB     35.2
2 sapply(1:n, f3)    180ms    189ms      5.18    3.35MB     31.1

As you can see, the loop is faster.

Avoid unnecessary operations

Example 1

Calling z as the last command in our function is the same as calling return(z).

From the R documentation:

If the end of a function is reached without calling return, the value of the last evaluated expression is returned.

Now, what about using print() instead?

f4 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  print(z)
}

Let’s benchmark it against f2():

mark(f2(n), f4(n))
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"     "Fizz"     "22"       "23"       "Fizz"    
[25] "Buzz"     "26"       "Fizz"     "28"       "29"       "FizzBuzz"
[31] "31"       "32"       "Fizz"     "34"       "Buzz"     "Fizz"    
[37] "37"       "38"       "Fizz"     "Buzz"     "41"       "Fizz"
...

  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f2(n)         131ms  139ms      6.71        NA    21.8      4    13      596ms
2 f4(n)         405ms  411ms      2.43        NA     8.52     2     7      822ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

f4() is 3 times slower.

What happened?

print() returns its argument, but it additionally prints it to the standard output. This is why the mark() function printed the output of f4() before printing the timings.

As you can see, printing takes a long time.

If you are evaluating f2() on its own (e.g. f2(20)), the returned result will also be printed to standard output and both functions will be equivalent. However, if you are using the function in another context, printing becomes an unnecessary and timely operation and f4() would be a very bad option. f4() is thus not a good function.

Here is an example in which f4() would perform a totally unnecessary operation that f2() avoids:

a <- f2(20)

No unnecessary printing.

a <- f4(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Unnecessary printing.

Even worse would be to use:

f5 <- function(n) {
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      print("FizzBuzz")
    } else if(i %% 3 == 0) {
      print("Fizz")
    } else if(i %% 5 == 0) {
      print("Buzz")
    } else {
      print(i)
    }
  }
}

mark(f2(n), f4(n), check = F)
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 f2(n)       132.8ms 141.69ms     6.77         NA     25.4     4    15
2 f5(n)         1.65s    1.65s     0.606        NA     12.7     1    21
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

We have to disable the check here because the results of the two functions are not technically the same (but we don’t care because, in both cases, the series gets created and that’s what we want).

Here the difference in timing is a factor of 12!

Example 2

One modulo operation and equality test can be removed by replacing i %% 3 == 0 && i %% 5 == 0 by i %% 15 == 0. We now have three modulo operations and equality tests per iteration instead of four. This gives us a little speedup:

f6 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 15 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

mark(f2(n), f6(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f2(n)         125ms    129ms      7.66    1.25MB     28.7
2 f6(n)         115ms    122ms      8.30    1.22MB     29.9

But we can remove an additional modulo operation and equality test at each iteration by assigning i %% 3 == 0 and i %% 5 == 0 to variables:

f7 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Now we only have two modulo operations and equality tests per iteration and we get another little speedup:

mark(f6(n), f7(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f6(n)         118ms    121ms      8.26    1.15MB     33.0
2 f7(n)         122ms    206ms      5.49    1.22MB     18.3

Example 3

We can assign 1:n to z instead of initializing it as an empty vector, thus rendering the assignment of i to z[i] in the last else statement unnecessary:

f8 <- function(n) {
  z <- 1:n
  for(i in z) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } 
  }
  z
}

This function works:

f8(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

and we get a really good speedup here:

mark(f7(n), f8(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f7(n)         102ms  116.7ms      7.84    1.15MB     25.5
2 f8(n)        62.4ms   98.6ms      9.78    1.15MB     26.1

Vectorize whenever possible

We can actually get rid of the loop and use a vectorized approach.

f9 <- function(n) {
  z <- 1:n
  div3 <- (z %% 3 == 0)
  div5 <- (z %% 5 == 0)
  z[div3] <- "Fizz"
  z[div5] <- "Buzz"
  z[(div3 & div5)] <- "FizzBuzz"
  z
}

This still give us the same result:

f9(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    
mark(f8(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f8(n)        64.5ms  103.5ms      9.87    1.15MB    25.7 
2 f9(n)        18.1ms   18.8ms     51.3     5.62MB     5.92

The speedup of 3.8 shows how important it is to use vectorization whenever possible.

Replace costly operations where possible

Sometimes, it isn’t obvious that one method will be faster than another. Benchmarking alternative expressions can teach you which ones are faster.

For instance, it is much faster to index a column from a dataframe by its name (e.g. dataframe$column1) than by using list indexing (e.g. dataframe[[1]]).

Sometimes, packages exist which bring much more efficiency than can be achieved with base R. In the case of data frames for example, there is data.table.

Conclusion

Starting from our first function f1(), we have gained a speedup of 7.4, simply by writing better code and without using parallelization and additional hardware:

mark(f1(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)       312.2ms  312.8ms      3.20   16.65MB    14.4 
2 f9(n)        18.2ms   18.9ms     43.5     5.57MB     3.95

If we used a silly function such as f5() as our starting function, the speedup would be 370.