Marie-Hélène Burle
May 14, 2024
Many fields of machine learning and data science rely on tabular data where
Early computer options to manipulate such data were limited to spreadsheets
Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming
The world was simple … but slow. Another problem: high memory usage
Wes McKinney (pandas creator) himself has complaints about it:
• Internals too far from “the metal”
• No support for memory-mapped datasets
• Poor performance in database and file ingest / export
• Warty missing data support
• Lack of transparency into memory use, RAM management
• Weak support for categorical data
• Complex groupby operations awkward and slow
• Appending data to a DataFrame tedious and very costly
• Limited, non-extensible type metadata
• Eager evaluation model, no query planning
• “Slow”, limited multicore algorithms for large datasets
Python global interpreter lock (GIL) gets in the way of multi-threading
Libraries such as Ray, Dask, and Apache Spark allow use of multiple cores and bring dataframes on clusters
Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray
fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray
RAPIDS brings dataframes on the GPUs with the cuDF library
Integration with pandas is easy
Vaex exists as an alternative to pandas (no integration)
Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both
DuckDB is a very fast and popular option with good integration with pandas
Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good
Pandas | Polars | |
---|---|---|
Available for | Python | Rust, Python, R, NodeJS |
Written in | Cython | Rust |
Multithreading | Some operations | Yes (GIL released) |
Index | Rows are indexed | Integer positions are used |
Evaluation | Eager only | Lazy and eager |
Query optimizer | No | Yes |
Out-of-core | No | Yes |
SIMD vectorization | Yes | Yes |
Data in memory | With NumPy arrays | With Apache Arrow arrays |
Memory efficiency | Poor | Excellent |
Handling of missing data | Inconsistent | Consistent, promotes type stability |
As good as Pandas’ (except for cuDF, still in development)
With NumPy: see the documentation, the from_numpy and to_numpy functions, the development progress of this integration, and performance advice
Parallel computing: with Ray thanks to this setting; with Spark, Dask, and Ray thanks to fugue
GPUs: with the cuDF library from RAPIDS (in development)
SQL: with DuckDB
The list is growing fast
Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can be combined
It makes most sense to compare Polars with another library occupying the same “niche” such as Pandas or Vaex
The net is full of benchmarks with consistent results: Polars is 3 to 150 times faster than Pandas
Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks)
As for Vaex, it seems twice slower and development has stalled over the past 10 months
The only framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for 6 months—a sharp contrast with the fast development of Polars
Personal computer:
python -m venv ~/env # Create virtual env
source ~/env/bin/activate # Activate virtual env
pip install --upgrade pip # Update pip
pip install polars # Install Polars
Alliance clusters (polars wheels are available, always prefer wheels when possible):
The package is well documented
Kevin Heavey wrote Modern Polars following the model of the Modern Pandas book. This is a great resource, although getting a little outdated for the scaling chapter since Polars is evolving so fast
Overall, the syntax feels somewhat similar to R’s dplyr from the tidyverse
While Pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables
After years with the one Python option (Pandas), there is currently this exuberant explosion of faster alternatives for dataframes
It might seem confusing and overwhelming, but in fact, the picture seems quite simple
For now, the new memory standard seems to be Apache Arrow and the most efficient library making use of it is Polars
The best strategy thus seems to be at the moment:
Single machine: Polars
Cluster: Polars + fugue (example benchmark, documentation of integration)
GPUs available: Polars + RAPIDS library cuDF (integration coming soon)
SQL: Polars + DuckDB (documentation of integration)
Or combination of the above (if cluster with GPUs, etc.)
As so many libraries are developing an integration with Polars, it is becoming hard to still find reasons to use Pandas
Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel
Execution: lazy where possible
File format: Apache Parquet
In fall 2024, I plan to offer an introductory course on Polars covering: