Parallelism: concepts
Once all sequential optimizations on the bottlenecks have been exhausted, it is time to consider whether parallelization makes sense.
This section covers important concepts that are necessary to understand before moving on to writing parallel code.
Embarrassingly parallel problems
Ideal cases for parallelization are embarrassingly parallel problems: problems which can be broken down into independent tasks without any work.
Examples:
- Loops for which all iterations are independent of each others.
- Resampling (e.g. bootstrapping or cross-validation).
- Ensemble learning (e.g. random forests).
Examples of problems which are not embarrassingly parallel:
- Loops for which the result of one iteration is needed for the next iteration.
- Recursive function calls.
- Problems that are inherently sequential.
For non-embarrassingly parallel problems, one solution is to use C++ to improve speed, as we will see at the end of this course.
Types of parallelism
There are various ways to run code in parallel and it is important to have a clear understanding of what each method entails.
Multi-threading
We talk about multi-threading when a single process (with its own memory) runs multiple threads.
The execution can happen in parallel—if each thread has access to a CPU core—or by alternating some of the threads on some CPU cores.
Because all threads in a process write to the same memory addresses, multi-threading can lead to race conditions.
R was not built with multi-threading. Many sites will use the term “multi-threading” improperly and actually mean “multi-processing”. Proper multi-threading cannot be achieved in R. A handful of packages (either very specialized or not under development anymore) bring multi-threading to R by using another language under the hood.
Multi-processing in distributed memory
When processes involved in the execution of some code run on multiple nodes of a cluster, messages between them need to travel over the cluster interconnect. In that case, we talk about distributed memory.