import pandas as pd
import numpy as np
import polars as pl
Comparison with pandas
As pandas was the only data frame library for Python for a long time, many Python users are familiar with it and a comparison with Polars might be useful.
Overview
pandas | Polars | |
---|---|---|
Available for | Python | Rust, Python, R, NodeJS |
Written in | Cython | Rust |
Multithreading | Some operations | Yes (GIL released) |
Index | Rows are indexed | Integer positions are used |
Evaluation | Eager only | Lazy and eager |
Query optimizer | No | Yes |
Out-of-core | No | Yes |
SIMD vectorization | Yes | Yes |
Data in memory | With NumPy arrays | With Apache Arrow arrays |
Memory efficiency | Poor | Excellent |
Handling of missing data | Inconsistent | Consistent, promotes type stability |
Performance
Let’s use the FizzBuzz problem.
In his pandas course, Alex compares multiple methods and shows that the best method uses masks. Let’s see how Polars fares in comparison to pandas’ best method.
First, let’s load the packages we will need:
And let’s make sure that the code works.
With pandas:
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"
df_pd.loc[df_pd[
df_pd
number response
0 1 1
1 2 2
2 3 Fizz
3 4 4
4 5 Buzz
... ... ...
9995 9996 Fizz
9996 9997 9997
9997 9998 9998
9998 9999 Fizz
9999 10000 Buzz
[10000 rows x 2 columns]
With Polars:
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(
df_pl.with_columns("number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias( )
shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪══════════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ Fizz │
│ 4 ┆ 4 │
│ 5 ┆ Buzz │
│ … ┆ … │
│ 9996 ┆ Fizz │
│ 9997 ┆ 9997 │
│ 9998 ┆ 9998 │
│ 9999 ┆ Fizz │
│ 10000 ┆ Buzz │
└────────┴──────────┘
Now, let’s time them.
pandas:
%%timeit
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz" df_pd.loc[df_pd[
4.75 ms ± 9.76 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Polars:
%%timeit
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(
df_pl.with_columns("number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias( )
518 μs ± 580 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
That’s a speedup of almost 10 (the longer the series, the larger this speedup will be).
Polars: 1, pandas: 0
For a second example, let’s go back to the jeopardy example with a large file and compare the timing of pandas and Polar.
pandas:
%%timeit
= pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
887 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars:
%%timeit
= pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").shape df_pl.
446 ms ± 89.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s a speedup of 2.
But it gets even better: Polars support lazy evaluation.
Lazy evaluation is not yet implemented when reading files from the cloud (Polars is a very new tool, but its functionalities are expanding very fast). This means that we cannot test the benefit of lazy evaluation in our example by using the CSV file in its current location (https://github.com/pola-rs/polars/issues/13115).
I downloaded it on our training cluster however so that we can run the test.
First, let’s make sure that the code works.
pandas:
= pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
(349, 7)
Polars:
= pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
(349, 7)
And now for the timing.
pandas:
%%timeit
= pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
331 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars:
%%timeit
= pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
13.1 ms ± 175 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That’s a speedup of 25 (the larger the file, the larger this speedup will be). This is because pl.scan_csv
doesn’t read the file. Instead, it creates a future. By using a lazy query, only the part of the file that is necessary actually gets read in. This potentially saves a lot of time for very large files and it even allows to work with files too large to fit in memory.
Lazy evaluation also allows the query optimizer to combine operations where possible, very much the way compiled languages work.
To evaluate the future and get a result, we use the collect
method.
Note that Polars also has a pl.read_csv
function if you want to use eager evaluation.
Polars: 2, pandas: 0
Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).
Comparison with other frameworks
Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can thus be combined.
It only makes sense to compare Polars with other libraries occupying the same “niche” such as pandas or Vaex.
For Vaex, some benchmark found it twice slower, but this could have changed with recent developments.
One framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for a year—a sharp contrast with the fast development of Polars.
Table visualization
While pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables to achieve the same goal.