The RAPIDS cuDF ecosystem

Author

Marie-Hélène Burle

What is RAPIDS?

CUDA needed

RAPIDS for dataframes

RAPIDS cuDF

Comparison

cuDF pandas cuDF Polars cuDF
Documentation Link
API Similar to pandas
Some differences
Exactly the same as pandas Exactly the same as Polars lazy API
Performance Very good
Operations fully run on GPU
Good
Automatic fallbacks to CPU for unsupported operations can lead to costly transfers between CPU and GPU
The best
Lazy execution + operations fully run on GPU
Installation Install cudf Install cudf
(cudf.pandas is a module)
Install polars with GPU engine extra
(polars[gpu])

Which one to use?

On CPU:

Unless you are stuck with existing code, pipelines, and workflows that you cannot change, you should use Polars instead of pandas and you should use the lazy API whenever you have large dataframes.

On GPU:

Similarly, unless you don’t have other options, when you turn towards GPUs, you should use Polars with the GPU engine. It gives you the best of both worlds: lazy execution and its advantages (running out-of-core, better performance, reduced memory impact) as well as code that fully runs on GPU. If you are already using Polars with the lazy API on CPU, the code is virtually the same (you just need to pass engine="gpu" to collect and you can customize the GPU engine for finer control if you want).

pandas cuDF will speed up your pandas code to some extent and at no coding cost, but it is the lest performant option since the pandas code that can’t be run on GPU will be run on CPU. Additionally, copying the data back and forth between host and devices is costly.

cuDF runs the code fully on GPU and is a much better option, but it requires learning a new API since it only works on a subset of pandas command and has differences with the pandas API.

In this course, we will focus on the best option: Polars with the GPU engine.