The RAPIDS cuDF ecosystem

Author

Marie-Hélène Burle

What is RAPIDS?

RAPIDS for dataframes

Comparison

cuDF pandas cuDF Polars cuDF
Documentation https://docs.rapids.ai/api/cudf/stable/cudf_pandas/
API Similar to pandas, but some differences Exactly the same as pandas Exactly the same as Polars lazy API
Performance Very good: the code fully runs on GPU Good: some of the code runs on GPU, but there are costly transfers between CPU and GPU because the code that can’t run on GPU runs on CPU The best: lazy execution + the code fully runs on GPU
Installation Install cuDF Install cuDF (it is a cuDF module) Install Polars with GPU engine extra

Which one to use?

On CPU:

Unless you are stuck with existing code, pipelines, and workflows that you cannot change, you should use Polars instead of pandas and you should use the lazy API whenever you have large dataframes.

On GPU:

Similarly, unless you don’t have other options, when you turn towards GPUs, you should use Polars with the GPU engine. It gives you the best of both worlds: lazy execution and its advantages (running out-of-core, better performance, reduced memory impact) as well as code that fully runs on GPU. If you are already using Polars with the lazy API on CPU, the code is virtually the same (you just need to pass engine="gpu" to collect and you can customize the GPU engine for finer control if you want).

pandas cuDF will speed up your pandas code to some extent and at no coding cost, but it is the lest performant option since the pandas code that can’t be run on GPU will be run on CPU. Additionally, copying the data back and forth between host and devices is costly.

cuDF runs the code fully on GPU and is a much better option, but it requires learning a new API since it only works on a subset of pandas command and has differences with the pandas API.

In this course, we will focus on the best option: Polars with the GPU engine.