The RAPIDS cuDF ecosystem
What is RAPIDS?
RAPIDS for dataframes
Comparison
| cuDF | pandas cuDF | Polars cuDF | |
|---|---|---|---|
| Documentation | https://docs.rapids.ai/api/cudf/stable/cudf_pandas/ | ||
| API | Similar to pandas, but some differences | Exactly the same as pandas | Exactly the same as Polars lazy API |
| Performance | Very good: the code fully runs on GPU | Good: some of the code runs on GPU, but there are costly transfers between CPU and GPU because the code that can’t run on GPU runs on CPU | The best: lazy execution + the code fully runs on GPU |
| Installation | Install cuDF | Install cuDF (it is a cuDF module) | Install Polars with GPU engine extra |
Which one to use?
On CPU:
Unless you are stuck with existing code, pipelines, and workflows that you cannot change, you should use Polars instead of pandas and you should use the lazy API whenever you have large dataframes.
On GPU:
Similarly, unless you don’t have other options, when you turn towards GPUs, you should use Polars with the GPU engine. It gives you the best of both worlds: lazy execution and its advantages (running out-of-core, better performance, reduced memory impact) as well as code that fully runs on GPU. If you are already using Polars with the lazy API on CPU, the code is virtually the same (you just need to pass engine="gpu" to collect and you can customize the GPU engine for finer control if you want).
pandas cuDF will speed up your pandas code to some extent and at no coding cost, but it is the lest performant option since the pandas code that can’t be run on GPU will be run on CPU. Additionally, copying the data back and forth between host and devices is costly.
cuDF runs the code fully on GPU and is a much better option, but it requires learning a new API since it only works on a subset of pandas command and has differences with the pandas API.
In this course, we will focus on the best option: Polars with the GPU engine.