DataFrames on steroids with Polars

noshadow

Marie-Hélène Burle

May 14, 2024


Background

Tabular data

Many fields of machine learning and data science rely on tabular data where

  • columns hold variables and are homogeneous (same data type)
  • rows contain observations and can be heterogeneous

Early computer options to manipulate such data were limited to spreadsheets

Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming

Early history of dataframes

y1 1990 y2 2000 y1--y2 y3 2008 y2--y3 l1 S programming language l2 R l3 Pandas (Python)

The world was simple … but slow. Another problem: high memory usage

Issues with Pandas

Wes McKinney (pandas creator) himself has complaints about it:

• Internals too far from “the metal”
• No support for memory-mapped datasets
• Poor performance in database and file ingest / export
• Warty missing data support
• Lack of transparency into memory use, RAM management
• Weak support for categorical data
• Complex groupby operations awkward and slow
• Appending data to a DataFrame tedious and very costly
• Limited, non-extensible type metadata
• Eager evaluation model, no query planning
• “Slow”, limited multicore algorithms for large datasets

Improving performance

Parallel computing

Python global interpreter lock (GIL) gets in the way of multi-threading

Libraries such as Ray, Dask, and Apache Spark allow use of multiple cores and bring dataframes on clusters

Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray

fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray

Accelerators

RAPIDS brings dataframes on the GPUs with the cuDF library

Integration with pandas is easy

Lazy out-of-core

Vaex exists as an alternative to pandas (no integration)

SQL

Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both

DuckDB is a very fast and popular option with good integration with pandas

Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good

Arrives Polars

Comparison with Pandas

Pandas Polars
Available for Python Rust, Python, R, NodeJS
Written in Cython Rust
Multithreading Some operations Yes (GIL released)
Index Rows are indexed Integer positions are used
Evaluation Eager only Lazy and eager
Query optimizer No Yes
Out-of-core No Yes
SIMD vectorization Yes Yes
Data in memory With NumPy arrays With Apache Arrow arrays
Memory efficiency Poor Excellent
Handling of missing data Inconsistent Consistent, promotes type stability

Polars integration with other tools

As good as Pandas’ (except for cuDF, still in development)

The list is growing fast

Benchmarks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can be combined

It makes most sense to compare Polars with another library occupying the same “niche” such as Pandas or Vaex

Benchmarks

The net is full of benchmarks with consistent results: Polars is 3 to 150 times faster than Pandas

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks)

As for Vaex, it seems twice slower and development has stalled over the past 10 months

The only framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for 6 months—a sharp contrast with the fast development of Polars

Getting started

Installation

Personal computer:

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip             # Update pip
pip install polars                    # Install Polars


Alliance clusters (polars wheels are available, always prefer wheels when possible):

python -m venv ~/env                  # Create virtual env
source ~/env/bin/activate             # Activate virtual env
pip install --upgrade pip --no-index  # Update pip from wheel
pip install polars --no-index         # Install Polars from wheel

Syntax

The package is well documented

Kevin Heavey wrote Modern Polars following the model of the Modern Pandas book. This is a great resource, although getting a little outdated for the scaling chapter since Polars is evolving so fast

Overall, the syntax feels somewhat similar to R’s dplyr from the tidyverse

Table visualization

While Pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables

The bottom line

A rich new field

After years with the one Python option (Pandas), there is currently this exuberant explosion of faster alternatives for dataframes

It might seem confusing and overwhelming, but in fact, the picture seems quite simple

For now, the new memory standard seems to be Apache Arrow and the most efficient library making use of it is Polars

Best performance strategy for software

The best strategy thus seems to be at the moment:

As so many libraries are developing an integration with Polars, it is becoming hard to still find reasons to use Pandas

Performance tips

  • Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel

  • Execution: lazy where possible

  • File format: Apache Parquet

Course on Polars coming this fall

In fall 2024, I plan to offer an introductory course on Polars covering:

  • basic syntax
  • how to use Polars in a Ray cluster on Alliance supercomputers thanks to fugue
  • how to run Polars on GPU thanks to cuDF if the project is available by then