DataFrames on steroids with Polars

Author

Marie-Hélène Burle

Content from the webinar slides for easier browsing.

Background

Tabular data

Many fields of machine learning and data science rely on tabular data where:

columns hold variables and are homogeneous (same data type),
rows contain observations and can be heterogeneous.

Early computer options to manipulate such data were limited to spreadsheets.

Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming.

Early history of dataframes

The world was simple … but slow. Another problem: high memory usage.

Issues with Pandas

Wes McKinney (pandas creator) himself has complaints about it:

• Internals too far from “the metal”
• No support for memory-mapped datasets
• Poor performance in database and file ingest / export
• Warty missing data support
• Lack of transparency into memory use, RAM management
• Weak support for categorical data
• Complex groupby operations awkward and slow
• Appending data to a DataFrame tedious and very costly
• Limited, non-extensible type metadata
• Eager evaluation model, no query planning
• “Slow”, limited multicore algorithms for large datasets

Improving performance

Parallel computing

Python global interpreter lock (GIL) gets in the way of multi-threading.

Libraries such as Ray, Dask, and Apache Spark allow use of multiple cores and bring dataframes on clusters.

Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray.

fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray.

Accelerators

RAPIDS brings dataframes on the GPUs with the cuDF library.

Integration with pandas is easy.

Lazy out-of-core

Vaex exists as an alternative to pandas (no integration).

SQL

Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both.

DuckDB is a very fast and popular option with good integration with pandas.

Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good.

Arrives Polars

Comparison with Pandas

	Pandas	Polars
Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager only	Lazy and eager
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	With NumPy arrays	With Apache Arrow arrays
Memory efficiency	Poor	Excellent
Handling of missing data	Inconsistent	Consistent, promotes type stability

Polars integration with other tools

As good as Pandas’ (except for cuDF, still in development).

With NumPy: see the documentation, the from_numpy and to_numpy functions, the development progress of this integration, and performance advice.
Parallel computing: with Ray thanks to this setting; with Spark, Dask, and Ray thanks to fugue.
GPUs: with the cuDF library from RAPIDS (in development).
SQL: with DuckDB.

The list is growing fast.

Benchmarks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can be combined.

It makes most sense to compare Polars with another library occupying the same “niche” such as Pandas or Vaex.

The net is full of benchmarks with consistent results: Polars is 3 to 150 times faster than Pandas.

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).

As for Vaex, it seems twice slower and development has stalled over the past 10 months.

The only framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for 6 months—a sharp contrast with the fast development of Polars.

Getting started

Installation

Personal computer:

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip             # Update pip
pip install polars                    # Install Polars

Alliance clusters (polars wheels are available, always prefer wheels when possible):

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip --no-index  # Update pip from wheel
pip install polars --no-index         # Install Polars from wheel

Syntax

The package is well documented.

Kevin Heavey wrote Modern Polars following the model of the Modern Pandas book. This is a great resource, although getting a little outdated for the scaling chapter since Polars is evolving so fast.

Overall, the syntax feels somewhat similar to R’s dplyr from the tidyverse.

Table visualization

While Pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables.

The bottom line

A rich new field

After years with the one Python option (Pandas), there is currently this exuberant explosion of faster alternatives for dataframes.

It might seem confusing and overwhelming, but in fact, the picture seems quite simple.

For now, the new memory standard seems to be Apache Arrow and the most efficient library making use of it is Polars.

Best performance strategy for software

The best strategy thus seems to be at the moment:

Single machine: Polars.
Cluster: Polars + fugue (example benchmark, documentation of integration).
GPUs available: Polars + RAPIDS library cuDF (integration coming soon).
SQL: Polars + DuckDB (documentation of integration).
Or combination of the above (if cluster with GPUs, etc.).

As so many libraries are developing an integration with Polars, it is becoming hard to still find reasons to use Pandas.

Performance tips

Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel.
Execution: lazy where possible.
File format: Apache Parquet.