DataFrames on steroids with Polars
Content from the webinar slides for easier browsing.
Background
Tabular data
Many fields of machine learning and data science rely on tabular data where
- columns hold variables and are homogeneous (same data type)
- rows contain observations and can be heterogeneous
Early computer options to manipulate such data were limited to spreadsheets
Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming
Early history of dataframes
The world was simple … but slow. Another problem: high memory usage
Issues with Pandas
Wes McKinney (pandas creator) himself has complaints about it:
• Internals too far from “the metal”
• No support for memory-mapped datasets
• Poor performance in database and file ingest / export
• Warty missing data support
• Lack of transparency into memory use, RAM management
• Weak support for categorical data
• Complex groupby operations awkward and slow
• Appending data to a DataFrame tedious and very costly
• Limited, non-extensible type metadata
• Eager evaluation model, no query planning
• “Slow”, limited multicore algorithms for large datasets
Improving performance
Parallel computing
Python global interpreter lock (GIL) gets in the way of multi-threading
Libraries such as Ray, Dask, and Apache Spark allow use of multiple cores and bring dataframes on clusters
Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray
fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray
Accelerators
RAPIDS brings dataframes on the GPUs with the cuDF library
Integration with pandas is easy
Lazy out-of-core
Vaex exists as an alternative to pandas (no integration)
SQL
Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both
DuckDB is a very fast and popular option with good integration with pandas
Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good
Arrives Polars
Comparison with Pandas
| Pandas | Polars | |
|---|---|---|
| Available for | Python | Rust, Python, R, NodeJS |
| Written in | Cython | Rust |
| Multithreading | Some operations | Yes (GIL released) |
| Index | Rows are indexed | Integer positions are used |
| Evaluation | Eager only | Lazy and eager |
| Query optimizer | No | Yes |
| Out-of-core | No | Yes |
| SIMD vectorization | Yes | Yes |
| Data in memory | With NumPy arrays | With Apache Arrow arrays |
| Memory efficiency | Poor | Excellent |
| Handling of missing data | Inconsistent | Consistent, promotes type stability |
Polars integration with other tools
As good as Pandas’ (except for cuDF, still in development)
With NumPy: see the documentation, the from_numpy and to_numpy functions, the development progress of this integration, and performance advice
Parallel computing: with Ray thanks to this setting; with Spark, Dask, and Ray thanks to fugue
GPUs: with the cuDF library from RAPIDS (in development)
SQL: with DuckDB
The list is growing fast
Benchmarks
Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can be combined
It makes most sense to compare Polars with another library occupying the same “niche” such as Pandas or Vaex
The net is full of benchmarks with consistent results: Polars is 3 to 150 times faster than Pandas
Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks)
As for Vaex, it seems twice slower and development has stalled over the past 10 months
The only framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for 6 months—a sharp contrast with the fast development of Polars
Getting started
Installation
Personal computer:
python -m venv ~/env # Create virtual env
source ~/env/bin/activate # Activate virtual env
pip install --upgrade pip # Update pip
pip install polars # Install Polars
Alliance clusters (polars wheels are available, always prefer wheels when possible):
python -m venv ~/env # Create virtual env
source ~/env/bin/activate # Activate virtual env
pip install --upgrade pip --no-index # Update pip from wheel
pip install polars --no-index # Install Polars from wheelSyntax
The package is well documented
Kevin Heavey wrote Modern Polars following the model of the Modern Pandas book. This is a great resource, although getting a little outdated for the scaling chapter since Polars is evolving so fast
Overall, the syntax feels somewhat similar to R’s dplyr from the tidyverse
Table visualization
While Pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables
The bottom line
A rich new field
After years with the one Python option (Pandas), there is currently this exuberant explosion of faster alternatives for dataframes
It might seem confusing and overwhelming, but in fact, the picture seems quite simple
For now, the new memory standard seems to be Apache Arrow and the most efficient library making use of it is Polars
Best performance strategy for software
The best strategy thus seems to be at the moment:
Single machine: Polars
Cluster: Polars + fugue (example benchmark, documentation of integration)
GPUs available: Polars + RAPIDS library cuDF (integration coming soon)
SQL: Polars + DuckDB (documentation of integration)
Or combination of the above (if cluster with GPUs, etc.)
As so many libraries are developing an integration with Polars, it is becoming hard to still find reasons to use Pandas
Performance tips
Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel
Execution: lazy where possible
File format: Apache Parquet