The RAPIDS cuDF ecosystem
What is RAPIDS?
CUDA needed
RAPIDS for dataframes
Comparison
| cuDF | pandas cuDF | Polars cuDF | |
|---|---|---|---|
| Documentation | Link | ||
| API | Similar to pandas Some differences |
Exactly the same as pandas | Exactly the same as Polars lazy API |
| Performance | Very good Operations fully run on GPU |
Good Automatic fallbacks to CPU for unsupported operations can lead to costly transfers between CPU and GPU |
The best Lazy execution + operations fully run on GPU |
| Installation | Install cudf |
Install cudf ( cudf.pandas is a module) |
Install polars with GPU engine extra ( polars[gpu]) |
Which one to use?
On CPU:
Unless you are stuck with existing code, pipelines, and workflows that you cannot change, you should use Polars instead of pandas and you should use the lazy API whenever you have large dataframes.
On GPU:
Similarly, unless you don’t have other options, when you turn towards GPUs, you should use Polars with the GPU engine. It gives you the best of both worlds: lazy execution and its advantages (running out-of-core, better performance, reduced memory impact) as well as code that fully runs on GPU. If you are already using Polars with the lazy API on CPU, the code is virtually the same (you just need to pass engine="gpu" to collect and you can customize the GPU engine for finer control if you want).
pandas cuDF will speed up your pandas code to some extent and at no coding cost, but it is the lest performant option since the pandas code that can’t be run on GPU will be run on CPU. Additionally, copying the data back and forth between host and devices is costly.
cuDF runs the code fully on GPU and is a much better option, but it requires learning a new API since it only works on a subset of pandas command and has differences with the pandas API.
In this course, we will focus on the best option: Polars with the GPU engine.