Comparison with pandas

Author

Marie-Hélène Burle

As pandas was the only data frame library for Python for a long time, many Python users are familiar with it and a comparison with Polars might be useful.

Overview

	pandas	Polars
Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager only	Lazy and eager
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	With NumPy arrays	With Apache Arrow arrays
Memory efficiency	Poor	Excellent
Handling of missing data	Inconsistent	Consistent, promotes type stability

Performance

Let’s use the FizzBuzz problem.

In his pandas course, Alex compares multiple methods and shows that the best method uses masks. Let’s see how Polars fares in comparison to pandas’ best method.

First, let’s load the packages we will need:

import pandas as pd
import numpy as np
import polars as pl

And let’s make sure that the code works.

With pandas:

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

df_pd

      number response
0          1        1
1          2        2
2          3     Fizz
3          4        4
4          5     Buzz
...      ...      ...
9995    9996     Fizz
9996    9997     9997
9997    9998     9998
9998    9999     Fizz
9999   10000     Buzz

[10000 rows x 2 columns]

With Polars:

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)

shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ ---    ┆ ---      │
│ i64    ┆ str      │
╞════════╪══════════╡
│ 1      ┆ 1        │
│ 2      ┆ 2        │
│ 3      ┆ Fizz     │
│ 4      ┆ 4        │
│ 5      ┆ Buzz     │
│ …      ┆ …        │
│ 9996   ┆ Fizz     │
│ 9997   ┆ 9997     │
│ 9998   ┆ 9998     │
│ 9999   ┆ Fizz     │
│ 10000  ┆ Buzz     │
└────────┴──────────┘

Now, let’s time them.

pandas:

%%timeit

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

4.75 ms ± 9.76 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Polars:

%%timeit

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)

518 μs ± 580 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a speedup of almost 10 (the longer the series, the larger this speedup will be).

Polars: 1, pandas: 0

For a second example, let’s go back to the jeopardy example with a large file and compare the timing of pandas and Polar.

pandas:

%%timeit

df_pd = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

887 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").shape

446 ms ± 89.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s a speedup of 2.

But it gets even better: Polars support lazy evaluation.

Lazy evaluation is not yet implemented when reading files from the cloud (Polars is a very new tool, but its functionalities are expanding very fast). This means that we cannot test the benefit of lazy evaluation in our example by using the CSV file in its current location (https://github.com/pola-rs/polars/issues/13115).

I downloaded it on our training cluster however so that we can run the test.

First, let’s make sure that the code works.

pandas:

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

(349, 7)

Polars:

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape

(349, 7)

And now for the timing.

pandas:

%%timeit

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape

331 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape

13.1 ms ± 175 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s a speedup of 25 (the larger the file, the larger this speedup will be). This is because pl.scan_csv doesn’t read the file. Instead, it creates a future. By using a lazy query, only the part of the file that is necessary actually gets read in. This potentially saves a lot of time for very large files and it even allows to work with files too large to fit in memory.

Lazy evaluation also allows the query optimizer to combine operations where possible, very much the way compiled languages work.

To evaluate the future and get a result, we use the collect method.

Note that Polars also has a pl.read_csv function if you want to use eager evaluation.

Polars: 2, pandas: 0

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).

Comparison with other frameworks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can thus be combined.

It only makes sense to compare Polars with other libraries occupying the same “niche” such as pandas or Vaex.

For Vaex, some benchmark found it twice slower, but this could have changed with recent developments.

One framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for a year—a sharp contrast with the fast development of Polars.

Table visualization

While pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables to achieve the same goal.