Comparison with pandas

Author

Marie-Hélène Burle

As pandas was the only data frame library for Python for a long time, many Python users are familiar with it and a comparison with Polars might be useful.

Overview

pandas Polars
Available for Python Rust, Python, R, NodeJS
Written in Cython Rust
Multithreading Some operations Yes (GIL released)
Index Rows are indexed Integer positions are used
Evaluation Eager Eager and lazy
Query optimizer No Yes
Out-of-core No Yes
SIMD vectorization Yes Yes
Data in memory With NumPy arrays With Apache Arrow arrays
Memory efficiency Poor Excellent
Handling of missing data Inconsistent Consistent, promotes type stability

Performance

Example 1

Let’s use the FizzBuzz problem.

In his pandas course, Alex compares multiple methods and shows that the best method uses masks. Let’s see how Polars fares in comparison to pandas’ best method.

First, let’s load the packages we will need:

import pandas as pd
import numpy as np
import polars as pl

And let’s make sure that the code works.

With pandas:

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

print(df_pd)
      number response
0          1        1
1          2        2
2          3     Fizz
3          4        4
4          5     Buzz
...      ...      ...
9995    9996     Fizz
9996    9997     9997
9997    9998     9998
9998    9999     Fizz
9999   10000     Buzz

[10000 rows x 2 columns]

With Polars:

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl = df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)

print(df_pl)
shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ ---    ┆ ---      │
│ i64    ┆ str      │
╞════════╪══════════╡
│ 1      ┆ 1        │
│ 2      ┆ 2        │
│ 3      ┆ Fizz     │
│ 4      ┆ 4        │
│ 5      ┆ Buzz     │
│ …      ┆ …        │
│ 9996   ┆ Fizz     │
│ 9997   ┆ 9997     │
│ 9998   ┆ 9998     │
│ 9999   ┆ Fizz     │
│ 10000  ┆ Buzz     │
└────────┴──────────┘

Now, let’s time them.

pandas:

%%timeit

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"
9.11 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Polars:

%%timeit

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)
930 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a speedup of 9 (the longer the series, the larger this speedup will be).

Example 2

For a second example, let’s go back to the jeopardy example with a large file and compare the timing of pandas and Polar.

First, let’s make sure that the code works.

pandas:

df_pd = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape
(349, 7)

Polars:

df_pl = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").shape
(349, 7)

And now for timings.

pandas:

%%timeit

df_pd = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape
1.49 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").shape
817 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s a speedup of 2.

But it gets much better with lazy evaluation. First, we create a LazyFrame instead of a DataFrame by using scan_csv instead of read_csv. The query is not evaluated but a graph is created. This allows the query optimizer to combine operations and perform optimizations where possible, very much the way compilers work. To evaluate the query and get a result, we use the collect method.

Let’s make sure that the lazy Polars code gives us the same result:

df_pl = pl.scan_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape
(349, 7)

Lazy timing:

%%timeit

df_pl = pl.scan_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape
72.2 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That’s a speedup of 20 (the larger the file, the larger this speedup will be).

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).

Comparison with other frameworks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can thus be combined.

It only makes sense to compare Polars with other libraries occupying the same “niche” such as pandas or Vaex.

For Vaex, some benchmark found it twice slower, but this could have changed with recent developments.

One framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for a year—a sharp contrast with the fast development of Polars.

Migrating from Pandas

Read the migration guide: it will help you write Polars code rather than “literally translated” Pandas code that runs, but doesn’t make use of Polars’ strengths. The differences in style mostly come from the fact that Polars runs in parallel.