The world of data frames
Let’s talk about data frames, how they came to the world of programming, how Pandas had the monopoly for many years in Python, and how things are changing very quickly at the moment.
Tabular data
Many fields of machine learning, data science, and humanities rely on tabular data where
- columns hold variables and are homogeneous (same data type),
- rows contain observations and can be heterogeneous.
Early computer options to manipulate such data were limited to spreadsheets (e.g. Microsoft Excel).
Dataframes (data frames or DataFrames) are two dimensional objects that brought tabular data to programming.
Early history of dataframes
After data frames emerged in S, then R, they were added to Python with the library Pandas in 2008.
The world was simple … but slow with high memory usage and it remains thus for a long time.
Issues with Pandas
While Pandas was the Python data frame library and is now at the core of many other libraries, Wes McKinney (pandas creator) himself has complaints about it, mostly:
- the internals are too far from the hardware,
- no support for memory-mapped datasets,
- poor performance in database and file ingest / export,
- lack of proper support for missing data,
- lack of memory use and RAM management transparency,
- weak support for categorical data,
groupby
operations are complex, awkward, and slow,- appending data to a DataFrame tedious and very costly,
- limited, non-extensible type metadata,
- eager evaluation model with no query planning,
- slow and limited multicore algorithms for large datasets.
A rich new field
After years of Pandas as the Python dataframe library, there is currently an exuberant explosion of faster alternatives.
Parallel computing
The Python global interpreter lock (GIL) gets in the way of multi-threading.
Libraries such as Ray, Dask, and Apache Spark allow the use of multiple cores and bring dataframes to clusters.
Dask and Spark have APIs for Pandas and Modin makes this even more trivial by providing a drop-in replacement for Pandas on Dask, Spark, and Ray.
fugue provides a unified interface for distributed computing that works on Spark, Dask, and Ray.
Accelerators
RAPIDS brings dataframes on the GPUs with the cuDF library.
Integration with pandas is easy.
Lazy out-of-core
Vaex exists as an alternative to pandas.
SQL
Structured query language (SQL) handles relational databases, but the distinction between SQL and dataframe software is getting increasingly blurry with most libraries now able to handle both.
DuckDB is a very fast and popular option with good integration with pandas.
Many additional options such as dbt and the snowflake snowpark Python API exist, although integration with pandas is not always as good.
Polars
And then came Polars.
The new memory standard is Apache Arrow and the most efficient library making use of it is Polars.
In addition, most libraries are developing an integration with Polars, lodging it nicely in the Python ecosystem.
For maximum dataframe efficiency, the best strategy currently seems to be:
- single machine ➔ use Polars,
- cluster ➔ use Polars + fugue (example benchmark, documentation of Polars integration),
- GPUs available ➔ use Polars + RAPIDS library cuDF (Polars integration coming soon),
- SQL ➔ use Polars + DuckDB (documentation of Polars integration),
- combination of the above (e.g. cluster with GPUs) ➔ use a combination of the above tools (e.g. Polars + fugue + RAPIDS).
Comparison with Pandas
Overview
Pandas | Polars | |
---|---|---|
Available for | Python | Rust, Python, R, NodeJS |
Written in | Cython | Rust |
Multithreading | Some operations | Yes (GIL released) |
Index | Rows are indexed | Integer positions are used |
Evaluation | Eager only | Lazy and eager |
Query optimizer | No | Yes |
Out-of-core | No | Yes |
SIMD vectorization | Yes | Yes |
Data in memory | With NumPy arrays | With Apache Arrow arrays |
Memory efficiency | Poor | Excellent |
Handling of missing data | Inconsistent | Consistent, promotes type stability |
Performance
You can run these on our training cluster by either of two methods:
Method 1
Method 2
Logging on the cluster through SSH, and running the following in your command line:
module load ipykernel/2023b
source /project/def-sponsor00/shared/scientificpython-env/bin/activate
salloc --time=2:00:0 --mem-per-cpu=3600
ipython
Let’s go back to the FizzBuzz problem.
The best method with Pandas used masks. Let’s see how Polars fares in comparison.
First, let’s load the packages we will need:
import pandas as pd
import numpy as np
import polars as pl
And let’s make sure that the code works.
With Pandas:
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"
df_pd.loc[df_pd[
df_pd
number response
0 1 1
1 2 2
2 3 Fizz
3 4 4
4 5 Buzz
... ... ...
9995 9996 Fizz
9996 9997 9997
9997 9998 9998
9998 9999 Fizz
9999 10000 Buzz
[10000 rows x 2 columns]
With Polars:
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(
df_pl.with_columns("number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias( )
shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪══════════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ Fizz │
│ 4 ┆ 4 │
│ 5 ┆ Buzz │
│ … ┆ … │
│ 9996 ┆ Fizz │
│ 9997 ┆ 9997 │
│ 9998 ┆ 9998 │
│ 9999 ┆ Fizz │
│ 10000 ┆ Buzz │
└────────┴──────────┘
Now, let’s time them.
Pandas:
%%timeit
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz" df_pd.loc[df_pd[
4.75 ms ± 9.76 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Polars:
%%timeit
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(
df_pl.with_columns("number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias( )
518 μs ± 580 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
That’s a speedup of almost 10 (the longer the series, the larger this speedup will be).
Polars: 1, Pandas: 0 🙂
For a second example, let’s go back to the jeopardy example with a large file and compare the timing of Pandas and Polar.
Pandas:
%%timeit
= pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
887 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars:
%%timeit
= pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").shape df_pl.
446 ms ± 89.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s a speedup of 2.
But it gets even better: Polars support lazy evaluation.
Lazy evaluation is not yet implemented when reading files from the cloud (Polars is a very new tool, but its functionalities are expanding very fast). This means that we cannot test the benefit of lazy evaluation in our example by using the CSV file in its current location (https://github.com/pola-rs/polars/issues/13115).
I downloaded it on our training cluster however so that we can run the test.
First, let’s make sure that the code works.
Pandas:
= pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
(349, 7)
Polars:
= pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
(349, 7)
And now for the timing.
Pandas:
%%timeit
= pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
331 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars:
%%timeit
= pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
13.1 ms ± 175 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That’s a speedup of 25 (the larger the file, the larger this speedup will be). This is because pl.scan_csv
doesn’t read the file. Instead, it creates a future. By using a lazy query, only the part of the file that is necessary actually gets read in. This potentially saves a lot of time for very large files and it even allows to work with files too large to fit in memory.
Lazy evaluation also allows the query optimizer to combine operations where possible, very much the way compiled languages work.
To evaluate the future and get a result, we use the collect
method.
Note that Polars also has a pl.read_csv
function if you want to use eager evaluation.
Polars: 2, Pandas: 0 🙂