import pandas as pd
import numpy as np
import polars as pl
Comparison with pandas
As pandas was the only data frame library for Python for a long time, many Python users are familiar with it and a comparison with Polars might be useful.
Overview
pandas | Polars | |
---|---|---|
Available for | Python | Rust, Python, R, NodeJS |
Written in | Cython | Rust |
Multithreading | Some operations | Yes (GIL released) |
Index | Rows are indexed | Integer positions are used |
Evaluation | Eager | Eager and lazy |
Query optimizer | No | Yes |
Out-of-core | No | Yes |
SIMD vectorization | Yes | Yes |
Data in memory | With NumPy arrays | With Apache Arrow arrays |
Memory efficiency | Poor | Excellent |
Handling of missing data | Inconsistent | Consistent, promotes type stability |
Performance
Example 1
Let’s use the FizzBuzz problem.
In his pandas course, Alex compares multiple methods and shows that the best method uses masks. Let’s see how Polars fares in comparison to pandas’ best method.
First, let’s load the packages we will need:
And let’s make sure that the code works.
With pandas:
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"
df_pd.loc[df_pd[
print(df_pd)
number response
0 1 1
1 2 2
2 3 Fizz
3 4 4
4 5 Buzz
... ... ...
9995 9996 Fizz
9996 9997 9997
9997 9998 9998
9998 9999 Fizz
9999 10000 Buzz
[10000 rows x 2 columns]
With Polars:
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(= df_pl.with_columns(
df_pl "number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias(
)
print(df_pl)
shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪══════════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ Fizz │
│ 4 ┆ 4 │
│ 5 ┆ Buzz │
│ … ┆ … │
│ 9996 ┆ Fizz │
│ 9997 ┆ 9997 │
│ 9998 ┆ 9998 │
│ 9999 ┆ Fizz │
│ 10000 ┆ Buzz │
└────────┴──────────┘
Now, let’s time them.
pandas:
%%timeit
= pd.DataFrame()
df_pd = 10_000
size "number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz" df_pd.loc[df_pd[
9.11 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Polars:
%%timeit
= 10_000
size = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl "number").cast(pl.String).alias("response"))
df_pl.with_columns(pl.col(
df_pl.with_columns("number") % 3 == 0)
pl.when(pl.col("Fizz"))
.then(pl.lit("number") % 5 == 0)
.when(pl.col("Buzz"))
.then(pl.lit("number") % 15 == 0)
.when(pl.col("FizzBuzz"))
.then(pl.lit("number"))
.otherwise(pl.col("response")
.alias( )
930 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
That’s a speedup of 9 (the longer the series, the larger this speedup will be).
Example 2
For a second example, let’s go back to the jeopardy example with a large file and compare the timing of pandas and Polar.
First, let’s make sure that the code works.
pandas:
= pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
(349, 7)
Polars:
= pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").shape df_pl.
(349, 7)
And now for timings.
pandas:
%%timeit
= pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd "Category"] == "HISTORY"].shape df_pd.loc[df_pd[
1.49 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars:
%%timeit
= pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").shape df_pl.
817 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s a speedup of 2.
But it gets much better with lazy evaluation. First, we create a LazyFrame instead of a DataFrame by using scan_csv
instead of read_csv
. The query is not evaluated but a graph is created. This allows the query optimizer to combine operations and perform optimizations where possible, very much the way compilers work. To evaluate the query and get a result, we use the collect
method.
Let’s make sure that the lazy Polars code gives us the same result:
= pl.scan_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
(349, 7)
Lazy timing:
%%timeit
= pl.scan_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl filter(pl.col("Category") == "HISTORY").collect().shape df_pl.
72.2 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
That’s a speedup of 20 (the larger the file, the larger this speedup will be).
Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).
Comparison with other frameworks
Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can thus be combined.
It only makes sense to compare Polars with other libraries occupying the same “niche” such as pandas or Vaex.
For Vaex, some benchmark found it twice slower, but this could have changed with recent developments.
One framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for a year—a sharp contrast with the fast development of Polars.
Migrating from Pandas
Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel.