Optimizations

Author

Marie-Hélène Burle

A lot of hardware is not the answer to poorly written code. Before considering parallelization, you should think about ways to optimize your code sequentially.

Why?

Not all code can be parallelized.
Parallelization is costly (overhead of parallelization and, if you use a supercomputer, waiting time to access an Alliance cluster or money spent on a commercial cloud).
The optimization of the sequential code will also benefit the parallel code.

In many cases, writing better code will save you more computing time than parallelization.

In this section, we will cover several principles by playing with the programmatic implementation of the fizz buzz game.

Toy example

Fizz buzz is a children game to practice divisions. Players take turn counting out loud from “1” while replacing:

any number divisible by 3 with the word “Fizz”,
any number divisible by 5 with the word “Buzz”,
any number divisible by both 3 and 5 with the word “FizzBuzz”.

This creates a series that starts with: "1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16", etc.

Let’s write functions that output series from 1 to n following these rules and time them to draw general principles about code efficiency.

Setup

First of all, we need to load the necessary modules:

module load StdEnv/2023 gcc/13.3 r/4.4.0

Then we need to launch a job. There are 2 options:

Interactive job

If there are few of us, we will use an interactive session with one CPU each. To launch it, run the following (in the Bash terminal, not in R):

salloc --time=2:00:00 --mem-per-cpu=3500M

We can then launch R:

Now, we load the benchmarking package that we will use throughout this section:

library(bench)

Batch jobs

If there are more of us than there are CPUs in the cluster, we will run batch jobs. In this Case:

Create an R script called optim.R with the code to run (you can reuse the same script for all sections on this page by editing it). Don’t forget to load the package bench in your script.
Create a bash script called optim.sh with the following:

optim.sh

#!/bin/bash
#SBATCH --time=15
#SBATCH --mem-per-cpu=3500M

module load StdEnv/2023 gcc/13.3 r/4.4.0
Rscript optim.R

Run the jobs with:

sbatch optim.sh

Optimizations

Proper memory pre-allocation

In order to store the results of a loop, we need to create an object and assign to it the result of the loop at each iteration. In this first function, we create an empty object z of class integer and of length 0 for that purpose:

f1 <- function(n) {
  z <- integer()
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

The second function is similar, but this time, we initialize z with its final length. This means that we are pre-allocating memory for the full vector before we run the loop instead of growing the vector at each iteration:

f2 <- function(n) {
  z <- integer(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Let’s make sure that our functions work by testing it on a short series:

f1(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

f2(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

Short series are good to get a feel for what our functions return, but they would be inadequate for benchmarking because the functions would run too fast and the timing differences would be too small. Always make sure that your function runs are long enough when you benchmark.

Let’s pick a bigger value for n:

n <- 1e5

Now, we can benchmark our functions:

mark(f1(n), f2(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)         306ms    311ms      3.22   16.55MB     11.3
2 f2(n)         214ms    219ms      4.58    1.15MB     16.8

f2() is consistently faster, although not by much (speedup of 1.4). In many cases, the difference you will find will be a lot greater.

In the cluster, because memory is allocated outside of R (by Slurm), it is not tracked by mark() (see documentation).

The output you can see on this site was obtained on my laptop. It shows that a properly written function with pre-allocated memory uses 14 times less memory.

Now, notice how our function actually returns a character and not an integer:

typeof(f2(n))

[1] "character"

So let’s create the object z, which will hold the results of our loop, directly of the proper type:

f3 <- function(n) {
  z <- character(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

And now for the benchmark against f2():

mark(f2(n), f3(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f2(n)         214ms    219ms      4.56    1.23MB     15.2
2 f3(n)         217ms    219ms      4.51   863.8KB     16.5

You can see that there is no difference in timing, but that f3() is still slightly better because it uses a little less memory. This shows that type matters, but the most important thing you want to worry about in memory pre-allocation is the final length of your objects.

No, loops are not a big ‘no no’

By now, you might be thinking: “Wait… aren’t loops a big ‘no no’ in R? I’ve always been told that they are slow and that one should always use functional programming! We are talking about optimization in this course and we are using loops?!?”

There are a lot of misconceptions around R loops. They can be very slow if you don’t pre-allocate memory. Otherwise they are almost always faster than functions (the apply() family or the tidyverse equivalent of the purrr::map() family). You can choose to use a functional programming approach for style and readability, but not for speed.

Let’s test it.

First we create a function:

f4 <- function(n) {
  if(n %% 3 == 0 && n %% 5 == 0) {
    "FizzBuzz"
  } else if(n %% 3 == 0) {
    "Fizz"
  } else if(n %% 5 == 0) {
    "Buzz"
  } else {
    n
  }
}

Then we have to pass our function through sapply().

Let’s make sure that the code works:

sapply(1:20, f4)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

Now, we compare the timing with that of f3() (our fastest function so far):

mark(f3(n), sapply(1:n, f4))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f3(n)              262ms    267ms      3.75   781.3KB     16.9
2 sapply(1:n, f4)    355ms    372ms      2.68    3.29MB     12.1

As you can see, the loop is faster (speed up of 1.4). On my laptop, it also used 4 times less memory.

Avoid unnecessary operations

Example 1

Calling z as the last command in our function is the same as calling return(z).

From the R documentation:

If the end of a function is reached without calling return, the value of the last evaluated expression is returned.

Now, what about using print() instead?

f5 <- function(n) {
  z <- character(n)
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  print(z)
}

Let’s test that it works:

f5(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

Now, let’s benchmark it against f3() (still our fastest function so far):

mark(f3(n), f5(n))

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"     "Fizz"     "22"       "23"       "Fizz"    
[25] "Buzz"     "26"       "Fizz"     "28"       "29"       "FizzBuzz"
[31] "31"       "32"       "Fizz"     "34"       "Buzz"     "Fizz"    
[37] "37"       "38"       "Fizz"     "Buzz"     "41"       "Fizz"
...

# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f3(n)         116ms  120ms      7.45        NA    26.1      4    14      537ms
2 f5(n)         925ms  925ms      1.08        NA     3.24     1     3      925ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

f5() is 7.7 times slower.

What happened?

print() returns its argument, but it additionally prints it to the standard output. This is why the mark() function printed the output of f5() before printing the timings.

As you can see, printing takes a long time.

If you are evaluating f3() on its own (e.g. f3(20)), the returned result will also be printed to standard output and both functions will be equivalent. However, if you are using the function in another context, printing becomes an unnecessary and timely operation and f5() would be a very bad option. f5() is thus not a good function.

Here is an example in which f5() would perform a totally unnecessary operation that f3() avoids:

a <- f3(20)

No unnecessary printing.

a <- f5(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

Unnecessary printing.

Even worse would be to use:

f6 <- function(n) {
  for(i in 1:n) {
    if(i %% 3 == 0 && i %% 5 == 0) {
      print("FizzBuzz")
    } else if(i %% 3 == 0) {
      print("Fizz")
    } else if(i %% 5 == 0) {
      print("Buzz")
    } else {
      print(i)
    }
  }
}

Let’s test it:

f6(20)

[1] 1
[1] 2
[1] "Fizz"
[1] 4
[1] "Buzz"
[1] "Fizz"
[1] 7
[1] 8
[1] "Fizz"
[1] "Buzz"
[1] 11
[1] "Fizz"
[1] 13
[1] 14
[1] "FizzBuzz"
[1] 16
[1] 17
[1] "Fizz"
[1] 19
[1] "Buzz"

The values are correct, although the output is of a different type (NULL instead of a character since our function didn’t return anything and the values got printed as a side effect of the for loop).

Benchmark against f3():

mark(f3(n), f6(n), check = F)

# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 f3(n)         105ms  108ms     9.10         NA    30.9      5    17      549ms
2 f6(n)            6s     6s     0.167        NA     3.34     1    20         6s
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

We need to disable the check here because the results of the two functions are not the same.

Here the difference in timing is a factor of 55.5 due to all those printing calls.

Example 2

One modulo operation and equality test can be removed by replacing i %% 3 == 0 && i %% 5 == 0 by i %% 15 == 0. We now have three modulo operations and equality tests per iteration instead of four. This gives us a little speedup:

f7 <- function(n) {
  z <- character(n)
  for(i in 1:n) {
    if(i %% 15 == 0) {
      z[i] <- "FizzBuzz"
    } else if(i %% 3 == 0) {
      z[i] <- "Fizz"
    } else if(i %% 5 == 0) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

The benchmark with our fastest function f3() gives:

mark(f3(n), f7(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f3(n)         326ms    330ms      3.03     781KB     12.1
2 f7(n)         311ms    311ms      3.22     852KB     11.3

But we can remove an additional modulo operation and equality test at each iteration by assigning i %% 3 == 0 and i %% 5 == 0 to variables:

f8 <- function(n) {
  z <- character(n)
  for(i in 1:n) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } else {
      z[i] <- i
    }
  }
  z
}

Now we only have two modulo operations and equality tests per iteration and we get another little speedup when we benchmark it against f7(), our new best function:

mark(f7(n), f8(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f7(n)         183ms    186ms      5.36     781KB     19.7
2 f8(n)         173ms    175ms      5.72     856KB     17.2

Example 3

We can assign 1:n to z instead of initializing it as an empty vector, thus rendering the assignment of i to z[i] in the last else statement unnecessary:

f9 <- function(n) {
  z <- 1:n
  for(i in z) {
    div3 <- (i %% 3 == 0)
    div5 <- (i %% 5 == 0)
    if(div3 && div5) {
      z[i] <- "FizzBuzz"
    } else if(div3) {
      z[i] <- "Fizz"
    } else if(div5) {
      z[i] <- "Buzz"
    } 
  }
  z
}

This function works:

f9(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

and we get a little more speedup when compared to f8()—our current best function:

mark(f8(n), f9(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f8(n)         180ms    182ms      5.37   781.3KB     17.9
2 f9(n)         130ms    134ms      7.47    1.15MB     18.7

Vectorize whenever possible

We can actually get rid of the loop and use a vectorized approach instead, utilizing what really constitutes the strength of the R language. The following is pure R style at its best:

f10 <- function(n) {
  z <- 1:n
  div3 <- (z %% 3 == 0)
  div5 <- (z %% 5 == 0)
  z[div3] <- "Fizz"
  z[div5] <- "Buzz"
  z[(div3 & div5)] <- "FizzBuzz"
  z
}

This still give us the same result:

f10(20)

 [1] "1"        "2"        "Fizz"     "4"        "Buzz"     "Fizz"    
 [7] "7"        "8"        "Fizz"     "Buzz"     "11"       "Fizz"    
[13] "13"       "14"       "FizzBuzz" "16"       "17"       "Fizz"    
[19] "19"       "Buzz"

Now for the benchmark with f9() (our best function up to this point):

mark(f9(n), f10(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f9(n)       114.6ms  117.8ms      8.11    1.21MB    17.8 
2 f10(n)       30.2ms   33.9ms     29.2     5.62MB     1.94

The speedup of 3.5 shows the importance of using vectorization whenever possible.

Replace costly operations where possible

Sometimes, it isn’t obvious that one method will be faster than another. Benchmarking alternative expressions can teach you which ones are faster.

For instance, it is much faster to index a column from a dataframe by its name (e.g. dataframe$column1) than by using list indexing (e.g. dataframe[[1]]). Until you test it, there is nothing obvious about this because it has to do with how R processes the data under the hood.

Use faster packages

To achieve the best performance, you should look for efficient packages and learn them.

Packages exist which bring much more efficiency than can be achieved with base R or the tidyverse. In the case of data frames for example, there is data.table.

Conclusion

Starting from our first function f1(), we have gained a speedup of 6, simply by writing better code and without using parallelization and additional hardware:

mark(f1(n), f10(n))

Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f1(n)         279ms  289.6ms      3.45   16.63MB    13.8 
2 f10(n)       34.7ms   36.6ms     26.9     5.57MB     1.92

If we used a silly function such as f6() as our starting function, the speedup would be 333.

Before thinking about running R in parallel or throwing GPUs at your problem, hoping that these would solve the slowness of your code, identify the bottlenecks and rewrite the slow sections more efficiently.