Parallelism: concepts
Once all sequential optimizations have been exhausted, it is time to consider whether parallelization makes sense.
This section covers important concepts that are necessary to understand clearly before moving on to writing parallel code.
Embarrassingly parallel problems
Ideal cases for parallelization are embarrassingly parallel problems: problems which can be broken down into independent tasks without any work.
Examples:
- Loops for which all iterations are independent of each others.
- Resampling (e.g. bootstrapping or cross-validation).
- Ensemble learning (e.g. random forests).
Examples of problems which are not embarrassingly parallel:
- Loops for which the result of one iteration is needed for the next iteration.
- Recursive function calls.
- Problems that are inherently sequential.
For non-embarrassingly parallel problems, one solution is to use C++ to improve speed, as we will see at the end of this course.
Types of parallelism
There are various ways to run code in parallel and it is important to have a clear understanding of what each method entails.
Multi-threading
We talk about multi-threading when a single process (with its own memory) runs multiple threads.
The execution can happen in parallel—if each thread has access to a CPU core—or by alternating some of the threads on some CPU cores.
Because all threads in a process write to the same memory addresses, multi-threading can lead to race conditions.
Multi-threading does not seem to be a common approach to parallelizing R code.
Multi-processing in distributed memory
When processes involved in the execution of some code run on multiple nodes of a cluster, messages between them need to travel over the cluster interconnect. In that case, we talk about distributed memory.