DataFrames on steroids with Polars

Marie-Hélène Burle

May 14, 2024

Background

Tabular data

Many fields of machine learning and data science rely on tabular data where

columns hold variables and are homogeneous (same data type)
rows contain observations and can be heterogeneous

Early computer options to manipulate such data were limited to spreadsheets

Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming

Early history of dataframes

The world was simple … but slow. Another problem: high memory usage

Issues with Pandas

Wes McKinney (pandas creator) himself has complaints about it:

• Internals too far from “the metal”
• No support for memory-mapped datasets
• Poor performance in database and file ingest / export
• Warty missing data support
• Lack of transparency into memory use, RAM management
• Weak support for categorical data
• Complex groupby operations awkward and slow
• Appending data to a DataFrame tedious and very costly
• Limited, non-extensible type metadata
• Eager evaluation model, no query planning
• “Slow”, limited multicore algorithms for large datasets

Improving performance

Parallel computing

Python global interpreter lock (GIL) gets in the way of multi-threading

Libraries such as Ray, Dask, and Apache Spark allow use of multiple cores and bring dataframes on clusters

Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray

fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray

Accelerators

RAPIDS brings dataframes on the GPUs with the cuDF library

Integration with pandas is easy

Lazy out-of-core

Vaex exists as an alternative to pandas (no integration)

SQL

Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both

DuckDB is a very fast and popular option with good integration with pandas

Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good

Arrives Polars

Comparison with Pandas

	Pandas	Polars
Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager only	Lazy and eager
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	With NumPy arrays	With Apache Arrow arrays
Memory efficiency	Poor	Excellent
Handling of missing data	Inconsistent	Consistent, promotes type stability

Polars integration with other tools

As good as Pandas’ (except for cuDF, still in development)

With NumPy: see the documentation, the from_numpy and to_numpy functions, the development progress of this integration, and performance advice
Parallel computing: with Ray thanks to this setting; with Spark, Dask, and Ray thanks to fugue
GPUs: with the cuDF library from RAPIDS (in development)
SQL: with DuckDB

The list is growing fast

Benchmarks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can be combined

It makes most sense to compare Polars with another library occupying the same “niche” such as Pandas or Vaex

Benchmarks

The net is full of benchmarks with consistent results: Polars is 3 to 150 times faster than Pandas

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks)

As for Vaex, it seems twice slower and development has stalled over the past 10 months

The only framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for 6 months—a sharp contrast with the fast development of Polars

Getting started

Installation

Personal computer:

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip             # Update pip
pip install polars                    # Install Polars

Alliance clusters (polars wheels are available, always prefer wheels when possible):

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip --no-index  # Update pip from wheel
pip install polars --no-index         # Install Polars from wheel

Syntax

The package is well documented

Kevin Heavey wrote Modern Polars following the model of the Modern Pandas book. This is a great resource, although getting a little outdated for the scaling chapter since Polars is evolving so fast

Overall, the syntax feels somewhat similar to R’s dplyr from the tidyverse

Table visualization

While Pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables

The bottom line

A rich new field

After years with the one Python option (Pandas), there is currently this exuberant explosion of faster alternatives for dataframes

It might seem confusing and overwhelming, but in fact, the picture seems quite simple

For now, the new memory standard seems to be Apache Arrow and the most efficient library making use of it is Polars

Best performance strategy for software

The best strategy thus seems to be at the moment:

Single machine: Polars
Cluster: Polars + fugue (example benchmark, documentation of integration)
GPUs available: Polars + RAPIDS library cuDF (integration coming soon)
SQL: Polars + DuckDB (documentation of integration)
Or combination of the above (if cluster with GPUs, etc.)

As so many libraries are developing an integration with Polars, it is becoming hard to still find reasons to use Pandas

Performance tips

Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel
Execution: lazy where possible
File format: Apache Parquet

Course on Polars coming this fall

In fall 2024, I plan to offer an introductory course on Polars covering:

basic syntax
how to use Polars in a Ray cluster on Alliance supercomputers thanks to fugue
how to run Polars on GPU thanks to cuDF if the project is available by then