Memory management

Author

Marie-Hélène Burle

Memory can be a limiting factor and releasing it when not needed can be critical to avoid out of memory states. On the other hand, memoisation is an optimization technique which consists of caching the results of heavy computations for re-use.

Memory and speed are thus linked in a trade-off.

Releasing memory

It is best to avoid creating very large intermediate objects that take space in memory unnecessarily.

  • One option is to use nested functions or functions chained with pipes.

  • Another option is to create the intermediate objects within the local environment of a function as they will automatically be deleted as soon as the function has finished running.

Let’s go over a basic example: let’s extract the sepal width variable from the iris dataset (one of the datasets that come packaged with R), take the natural logarithm of the values, and round them to one decimal place.

First, let’s delete all objects inside our environment to make our little test as clean as possible:

rm(list=ls())
ls()
character(0)

Now, we could perform our task this way:

sepalwidth <- iris$Sepal.Width
sepalwidth_ln <- log(sepalwidth)
round(sepalwidth_ln, 1)
  [1] 1.3 1.1 1.2 1.1 1.3 1.4 1.2 1.2 1.1 1.1 1.3 1.2 1.1 1.1 1.4 1.5 1.4 1.3
 [19] 1.3 1.3 1.2 1.3 1.3 1.2 1.2 1.1 1.2 1.3 1.2 1.2 1.1 1.2 1.4 1.4 1.1 1.2
 [37] 1.3 1.3 1.1 1.2 1.3 0.8 1.2 1.3 1.3 1.1 1.3 1.2 1.3 1.2 1.2 1.2 1.1 0.8
 [55] 1.0 1.0 1.2 0.9 1.1 1.0 0.7 1.1 0.8 1.1 1.1 1.1 1.1 1.0 0.8 0.9 1.2 1.0
 [73] 0.9 1.0 1.1 1.1 1.0 1.1 1.1 1.0 0.9 0.9 1.0 1.0 1.1 1.2 1.1 0.8 1.1 0.9
 [91] 1.0 1.1 1.0 0.8 1.0 1.1 1.1 1.1 0.9 1.0 1.2 1.0 1.1 1.1 1.1 1.1 0.9 1.1
[109] 0.9 1.3 1.2 1.0 1.1 0.9 1.0 1.2 1.1 1.3 1.0 0.8 1.2 1.0 1.0 1.0 1.2 1.2
[127] 1.0 1.1 1.0 1.1 1.0 1.3 1.0 1.0 1.0 1.1 1.2 1.1 1.1 1.1 1.1 1.1 1.0 1.2
[145] 1.2 1.1 0.9 1.1 1.2 1.1

But this creates the unnecessary intermediate variables sepalwidth and sepalwidth_ln which get stored in memory:

ls()
[1] "sepalwidth"    "sepalwidth_ln"

For very large objects, this is not ideal.

Let’s clear objects in our environment again:

rm(list=ls())
ls()
character(0)

A better option is to use nested functions:

round(log(iris$Sepal.Width), 1)
  [1] 1.3 1.1 1.2 1.1 1.3 1.4 1.2 1.2 1.1 1.1 1.3 1.2 1.1 1.1 1.4 1.5 1.4 1.3
 [19] 1.3 1.3 1.2 1.3 1.3 1.2 1.2 1.1 1.2 1.3 1.2 1.2 1.1 1.2 1.4 1.4 1.1 1.2
 [37] 1.3 1.3 1.1 1.2 1.3 0.8 1.2 1.3 1.3 1.1 1.3 1.2 1.3 1.2 1.2 1.2 1.1 0.8
 [55] 1.0 1.0 1.2 0.9 1.1 1.0 0.7 1.1 0.8 1.1 1.1 1.1 1.1 1.0 0.8 0.9 1.2 1.0
 [73] 0.9 1.0 1.1 1.1 1.0 1.1 1.1 1.0 0.9 0.9 1.0 1.0 1.1 1.2 1.1 0.8 1.1 0.9
 [91] 1.0 1.1 1.0 0.8 1.0 1.1 1.1 1.1 0.9 1.0 1.2 1.0 1.1 1.1 1.1 1.1 0.9 1.1
[109] 0.9 1.3 1.2 1.0 1.1 0.9 1.0 1.2 1.1 1.3 1.0 0.8 1.2 1.0 1.0 1.0 1.2 1.2
[127] 1.0 1.1 1.0 1.1 1.0 1.3 1.0 1.0 1.0 1.1 1.2 1.1 1.1 1.1 1.1 1.1 1.0 1.2
[145] 1.2 1.1 0.9 1.1 1.2 1.1

An equivalent option is to chain functions:

iris$Sepal.Width |> log() |> round(1)
  [1] 1.3 1.1 1.2 1.1 1.3 1.4 1.2 1.2 1.1 1.1 1.3 1.2 1.1 1.1 1.4 1.5 1.4 1.3
 [19] 1.3 1.3 1.2 1.3 1.3 1.2 1.2 1.1 1.2 1.3 1.2 1.2 1.1 1.2 1.4 1.4 1.1 1.2
 [37] 1.3 1.3 1.1 1.2 1.3 0.8 1.2 1.3 1.3 1.1 1.3 1.2 1.3 1.2 1.2 1.2 1.1 0.8
 [55] 1.0 1.0 1.2 0.9 1.1 1.0 0.7 1.1 0.8 1.1 1.1 1.1 1.1 1.0 0.8 0.9 1.2 1.0
 [73] 0.9 1.0 1.1 1.1 1.0 1.1 1.1 1.0 0.9 0.9 1.0 1.0 1.1 1.2 1.1 0.8 1.1 0.9
 [91] 1.0 1.1 1.0 0.8 1.0 1.1 1.1 1.1 0.9 1.0 1.2 1.0 1.1 1.1 1.1 1.1 0.9 1.1
[109] 0.9 1.3 1.2 1.0 1.1 0.9 1.0 1.2 1.1 1.3 1.0 0.8 1.2 1.0 1.0 1.0 1.2 1.2
[127] 1.0 1.1 1.0 1.1 1.0 1.3 1.0 1.0 1.0 1.1 1.2 1.1 1.1 1.1 1.1 1.1 1.0 1.2
[145] 1.2 1.1 0.9 1.1 1.2 1.1

Another option is to create the intermediate variables in the local environment of a function:

get_sepalwidth <- function(dataset) {
  sepalwidth <- dataset$Sepal.Width
  sepalwidth_ln <- log(sepalwidth)
  round(sepalwidth_ln, 1)
}

get_sepalwidth(iris)
  [1] 1.3 1.1 1.2 1.1 1.3 1.4 1.2 1.2 1.1 1.1 1.3 1.2 1.1 1.1 1.4 1.5 1.4 1.3
 [19] 1.3 1.3 1.2 1.3 1.3 1.2 1.2 1.1 1.2 1.3 1.2 1.2 1.1 1.2 1.4 1.4 1.1 1.2
 [37] 1.3 1.3 1.1 1.2 1.3 0.8 1.2 1.3 1.3 1.1 1.3 1.2 1.3 1.2 1.2 1.2 1.1 0.8
 [55] 1.0 1.0 1.2 0.9 1.1 1.0 0.7 1.1 0.8 1.1 1.1 1.1 1.1 1.0 0.8 0.9 1.2 1.0
 [73] 0.9 1.0 1.1 1.1 1.0 1.1 1.1 1.0 0.9 0.9 1.0 1.0 1.1 1.2 1.1 0.8 1.1 0.9
 [91] 1.0 1.1 1.0 0.8 1.0 1.1 1.1 1.1 0.9 1.0 1.2 1.0 1.1 1.1 1.1 1.1 0.9 1.1
[109] 0.9 1.3 1.2 1.0 1.1 0.9 1.0 1.2 1.1 1.3 1.0 0.8 1.2 1.0 1.0 1.0 1.2 1.2
[127] 1.0 1.1 1.0 1.1 1.0 1.3 1.0 1.0 1.0 1.1 1.2 1.1 1.1 1.1 1.1 1.1 1.0 1.2
[145] 1.2 1.1 0.9 1.1 1.2 1.1

None of these options left intermediate variables in our environment:

ls()
[1] "get_sepalwidth"

Note that in the case of a very large function, it might still be beneficial to run rm() inside the function to clear the memory for other processes coming next within that function. But this is a pretty rare case.

If you really have to create large intermediate objects in the global environment, make sure to delete them as soon as you don’t need them anymore (e.g. rm(sepalwidth, sepalwidth_ln)).

rm() deletes the names of variables (the pointers to objects in memory). But as soon as all the pointers to an object in memory are deleted, the garbage collector clears its value and releases the memory it used.

Caching

Memoisation is a technique by which some results are cached to avoid re-calculating them. This is convenient in a variety of settings (e.g. to reduce calls to an API, to avoid repeating heavy computations). In particular, it improves the efficiency of recursive function calls dramatically.

Let’s consider the calculation of the Fibonacci numbers as an example. Those numbers form a sequence starting with 0 and 11, after which each number is the sum of the previous two.

1 Alternative versions have the sequence start with 1, 1 or with 1, 2.

2 There are more efficient ways to calculate the Fibonacci numbers, but this inefficient function is a great example to show the advantage of memoisation.

Here is a function that would return the nth Fibonacci number2:

fib <- function(n) {
  if(n == 0) {
    return(0)
  } else if(n == 1) {
    return(1)
  } else {
    Recall(n - 1) + Recall(n - 2)
  }
}

It can be written more tersely as:

fib <- function(n) {
  if(n == 0) return(0)
  if(n == 1) return(1)
  Recall(n - 1) + Recall(n - 2)
}

Recall() is a placeholder for the name of the recursive function. We could have used fib() instead, but Recall() is more robust as it allows for function renaming.

Memoisation is very useful here because, for each Fibonacci number, we need to calculate the two preceding Fibonacci numbers and to calculate each of those we need to calculate the two Fibonacci numbers preceding that one and to calculate… etc. That is a large number of calculations, but, thanks to caching, we don’t have to calculate any one of them more than once.

The packages R.cache and memoise both allow for memoisation with an incredibly simple syntax.

Applying the latter to our function gives us:

library(memoise)

fibmem <- memoise(
  function(n) {
    if(n == 0) return(0)
    if(n == 1) return(1)
    Recall(n - 1) + Recall(n - 2)
  }
)

We can do some benchmarking to see the speedup for the 30th Fibonacci number:

library(bench)

n <- 30
mark(fib(n), fibmem(n))
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 fib(n)        1.62s    1.62s     0.616    32.9KB     18.5
2 fibmem(n)   41.22µs  44.62µs 20807.       68.3KB     14.6

The speedup is over 35,000!