Optimizations

Author

Marie-Hélène Burle

A lot of hardware is not the answer to poorly written code. Before considering parallelization, you should think about ways to optimize your code sequentially.

Why?

  • not all code can be parallelized,
  • parallelization is costly (overhead of parallelization and, if you use a supercomputer, waiting time to access an Alliance cluster or money spent on a commercial cloud),
  • the optimization of the sequential code will also benefit the parallel code.

In many cases, writing better code will save you more computing time than parallelization.

In this section, we will cover several principles by playing with the programmatic implementation of the fizz buzz game.

Toy example

Fizz buzz is a children game to practice divisions. Players take turn counting out loud while replacing:

  • any number divisible by 3 with the word “Fizz”,
  • any number divisible by 5 with the word “Buzz”,
  • any number divisible by both 3 and 5 with the word “FizzBuzz”.

Let’s write functions that output series from 1 to n following these rules and time them to draw general principles about code efficiency.

Setup

First of all, we need to load the necessary modules:

module load StdEnv/2020 gcc/11.3.0 r/4.3.1

Then we need to launch a job.

Interactive job

If there are few of us, we will use interactive sessions with one CPU each with:

salloc --time=2:00:00 --mem-per-cpu=3500M

We can then launch R and load the benchmarking package we will use throughout this section:

library(bench)

Batch jobs

If there are more of us than there are CPUs in the cluster, we will run batch jobs. In this Case:

  • Create an R script called optim.R with the code to run (you can reuse the same script for all sections on this page by editing it). Don’t forget to load the package bench in your script.
  • Create a bash script called optim.sh with the following:
<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.3.1
Rscript <your_script>.R
  • Run the jobs with:
sbatch optim.sh

Optimizations

Pre-allocate memory

In order to store the results of a loop, we need to create an object and assign to it the result of the loop at each iteration. In this first function, we create an empty object z of class integer and of length 0 for that purpose:

f1 <- function(n) {
  z <- integer()
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

The second function is similar, but this time, we initialize z with its final length. This means that we are pre-allocating memory for the full vector before we run the loop instead of growing the vector at each iteration:

f2 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Let’s make sure that our functions work by testing it on a small number:

f1(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    
f2(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Now, let’s benchmark them for a large number:

n <- 1e5
mark(f1(n), f2(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)         155ms    219ms      4.79   16.55MB     16.0
2 f2(n)         126ms    150ms      6.30    1.15MB     20.5

f2() is consistently faster, although very slightly. In many cases, the difference you will find will be a lot greater.

Note also the large difference in memory allocation.

No, loops are not a big ‘no no’

By now, you might be thinking: “Wait… aren’t loops a big ‘no no’ in R? I’ve always been told that they are slow and that one should always use functional programming! We are talking about optimization in this course and we are using loops?!?”

There are a lot of misconceptions around R loops. They can be very slow if you don’t pre-allocate memory. Otherwise they are almost always faster than functions (the apply() family or the tidyverse equivalent of the purrr::map() family). You can choose to use a functional programming approach for style and readability, but not for speed.

Let’s test it.

First we create a function:

f3 <- function(n) {
  if(n %% 3 == 0 && n %% 5 == 0) {
    "FizzBuzz"
  } else if(n %% 3 == 0) {
    "Fizz"
  } else if(n %% 5 == 0) {
    "Buzz"
  } else {
    n
  }
}

Then we pass it through sapply(). We can test that it works on a small number:

sapply(1:20, f3)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Finally, we compare the timing with that of f2():

mark(f2(n), sapply(1:n, f3))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f2(n)              259ms    261ms      3.83    1.15MB     15.3
2 sapply(1:n, f3)    188ms    199ms      4.88    3.29MB     21.2

As you can see, the loop is faster.

Avoid unnecessary operations

Example 1

Calling z as the last command in our function is the same as calling return(z).

From the R documentation:

If the end of a function is reached without calling return, the value of the last evaluated expression is returned.

Now, what about using print() instead?

f4 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  print(z)
}

Let’s benchmark it against f2():

mark(f2(n), f4(n))
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"     "Fizz"     "22"       "23"       "Fizz"    
[25] "Buzz"     "26"       "Fizz"     "28"       "29"       "FizzBuzz"
[31] "31"       "32"       "Fizz"     "34"       "Buzz"     "Fizz"    
[37] "37"       "38"       "Fizz"     "Buzz"     "41"       "Fizz"
...

  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f2(n)         131ms  139ms      6.71        NA    21.8      4    13      596ms
2 f4(n)         405ms  411ms      2.43        NA     8.52     2     7      822ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

f4() is 3 times slower.

What happened?

print() returns its argument, but it additionally prints it to the standard output. This is why the mark() function printed the output of f4() before printing the timings.

As you can see, printing takes a long time.

If you are evaluating f2() on its own (e.g. f2(20)), the returned result will also be printed to standard output and both functions will be equivalent. However, if you are using the function in another context, printing becomes an unnecessary and timely operation and f4() would be a very bad option. f4() is thus not a good function.

Here is an example in which f4() would perform a totally unnecessary operation that f2() avoids:

a <- f2(20)

No unnecessary printing.

a <- f4(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

Unnecessary printing.

Even worse would be to use:

f5 <- function(n) {
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      print("FizzBuzz")
    } else if(i %% 3 == 0) {
      print("Fizz")
    } else if(i %% 5 == 0) {
      print("Buzz")
    } else {
      print(i)
    }
  }
}

mark(f2(n), f4(n), check = F)
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 f2(n)       132.8ms 141.69ms     6.77         NA     25.4     4    15
2 f5(n)         1.65s    1.65s     0.606        NA     12.7     1    21
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

We have to disable the check here because the results of the two functions are not technically the same (but we don’t care because, in both cases, the series gets created and that’s what we want).

Here the difference in timing is a factor of 12!

Example 2

One modulo operation and equality test can be removed by replacing i %% 3 == 0 && i %% 5 == 0 by i %% 15 == 0. We now have three modulo operations and equality tests per iteration instead of four. This gives us a little speedup:

f6 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 15 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

mark(f2(n), f6(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f2(n)         196ms    205ms      4.75    1.15MB     17.4
2 f6(n)         169ms    186ms      5.39    1.22MB     18.0

But we can remove an additional modulo operation and equality test at each iteration by assigning i %% 3 == 0 and i %% 5 == 0 to variables:

f7 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Now we only have two modulo operations and equality tests per iteration and we get another little speedup:

mark(f6(n), f7(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f6(n)         178ms    179ms      5.53    1.15MB     18.4
2 f7(n)         162ms    163ms      5.96    1.22MB     15.9

Example 3

We can assign 1:n to z instead of initializing it as an empty vector, thus rendering the assignment of i to z[i] in the last else statement unnecessary:

f8 <- function(n) {
  z <- 1:n
  for(i in z) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } 
  }
  z
}

This function works:

f8(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    

and we get a really good speedup here:

mark(f7(n), f8(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f7(n)       127.9ms  146.2ms      6.46    1.15MB     19.4
2 f8(n)        76.4ms   87.7ms     11.3     1.15MB     22.6

Vectorize whenever possible

We can actually get rid of the loop and use a vectorized approach.

f9 <- function(n) {
  z <- 1:n
  div3 <- (z %% 3 == 0)
  div5 <- (z %% 5 == 0)
  z[div3] <- "Fizz"
  z[div5] <- "Buzz"
  z[(div3 & div5)] <- "FizzBuzz"
  z
}

This still give us the same result:

f9(20)
 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"    
mark(f8(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f8(n)          73ms     83ms      11.7    1.15MB    25.3 
2 f9(n)        23.4ms     44ms      24.6    5.62MB     1.89

The speedup of 3.8 shows how important it is to use vectorization whenever possible.

Replace costly operations where possible

Sometimes, it isn’t obvious that one method will be faster than another. Benchmarking alternative expressions can teach you which ones are faster.

For instance, it is much faster to index a column from a dataframe by its name (e.g. dataframe$column1) than by using list indexing (e.g. dataframe[[1]]).

Sometimes, packages exist which bring much more efficiency than can be achieved with base R. In the case of data frames for example, there is data.table.

Conclusion

Starting from our first function f1(), we have gained a speedup of 7.4, simply by writing better code and without using parallelization and additional hardware:

mark(f1(n), f9(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)       174.4ms  195.6ms      4.79   16.55MB    19.1 
2 f9(n)        26.7ms   48.2ms     21.6     5.57MB     1.96

If we used a silly function such as f5() as our starting function, the speedup would be 370.