Comparison with pandas

Author

Marie-Hélène Burle

As pandas was the only data frame library for Python for a long time, many Python users are familiar with it and a comparison with Polars might be useful.

Overview

pandas Polars
Available for Python Rust, Python, R, NodeJS
Written in Cython Rust
Multithreading Some operations Yes (GIL released)
Index Rows are indexed Integer positions are used
Evaluation Eager only Lazy and eager
Query optimizer No Yes
Out-of-core No Yes
SIMD vectorization Yes Yes
Data in memory With NumPy arrays With Apache Arrow arrays
Memory efficiency Poor Excellent
Handling of missing data Inconsistent Consistent, promotes type stability

Performance

Let’s use the FizzBuzz problem.

In his pandas course, Alex compares multiple methods and shows that the best method uses masks. Let’s see how Polars fares in comparison to pandas’ best method.

First, let’s load the packages we will need:

import pandas as pd
import numpy as np
import polars as pl

And let’s make sure that the code works.

With pandas:

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"

df_pd
      number response
0          1        1
1          2        2
2          3     Fizz
3          4        4
4          5     Buzz
...      ...      ...
9995    9996     Fizz
9996    9997     9997
9997    9998     9998
9998    9999     Fizz
9999   10000     Buzz

[10000 rows x 2 columns]

With Polars:

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)
shape: (10_000, 2)
┌────────┬──────────┐
│ number ┆ response │
│ ---    ┆ ---      │
│ i64    ┆ str      │
╞════════╪══════════╡
│ 1      ┆ 1        │
│ 2      ┆ 2        │
│ 3      ┆ Fizz     │
│ 4      ┆ 4        │
│ 5      ┆ Buzz     │
│ …      ┆ …        │
│ 9996   ┆ Fizz     │
│ 9997   ┆ 9997     │
│ 9998   ┆ 9998     │
│ 9999   ┆ Fizz     │
│ 10000  ┆ Buzz     │
└────────┴──────────┘

Now, let’s time them.

pandas:

%%timeit

df_pd = pd.DataFrame()
size = 10_000
df_pd["number"] = np.arange(1, size+1)
df_pd["response"] = df_pd["number"].astype(str)
df_pd.loc[df_pd["number"] % 3 == 0, "response"] = "Fizz"
df_pd.loc[df_pd["number"] % 5 == 0, "response"] = "Buzz"
df_pd.loc[df_pd["number"] % 15 == 0, "response"] = "FizzBuzz"
4.75 ms ± 9.76 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Polars:

%%timeit

size = 10_000
df_pl = pl.DataFrame({"number": np.arange(1, size+1)})
df_pl.with_columns(pl.col("number").cast(pl.String).alias("response"))
df_pl.with_columns(
    pl.when(pl.col("number") % 3 == 0)
    .then(pl.lit("Fizz"))
    .when(pl.col("number") % 5 == 0)
    .then(pl.lit("Buzz"))
    .when(pl.col("number") % 15 == 0)
    .then(pl.lit("FizzBuzz"))
    .otherwise(pl.col("number"))
    .alias("response")
)
518 μs ± 580 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a speedup of almost 10 (the longer the series, the larger this speedup will be).

Polars: 1, pandas: 0

For a second example, let’s go back to the jeopardy example with a large file and compare the timing of pandas and Polar.

pandas:

%%timeit

df_pd = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape
887 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").shape
446 ms ± 89.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s a speedup of 2.

But it gets even better: Polars support lazy evaluation.

Lazy evaluation is not yet implemented when reading files from the cloud (Polars is a very new tool, but its functionalities are expanding very fast). This means that we cannot test the benefit of lazy evaluation in our example by using the CSV file in its current location (https://github.com/pola-rs/polars/issues/13115).

I downloaded it on our training cluster however so that we can run the test.

First, let’s make sure that the code works.

pandas:

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape
(349, 7)

Polars:

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape
(349, 7)

And now for the timing.

pandas:

%%timeit

df_pd = pd.read_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pd.loc[df_pd["Category"] == "HISTORY"].shape
331 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars:

%%timeit

df_pl = pl.scan_csv("/project/def-sponsor00/data/jeopardy.csv")
df_pl.filter(pl.col("Category") == "HISTORY").collect().shape
13.1 ms ± 175 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s a speedup of 25 (the larger the file, the larger this speedup will be). This is because pl.scan_csv doesn’t read the file. Instead, it creates a future. By using a lazy query, only the part of the file that is necessary actually gets read in. This potentially saves a lot of time for very large files and it even allows to work with files too large to fit in memory.

Lazy evaluation also allows the query optimizer to combine operations where possible, very much the way compiled languages work.

To evaluate the future and get a result, we use the collect method.

Note that Polars also has a pl.read_csv function if you want to use eager evaluation.

Polars: 2, pandas: 0

Pandas is trying to fight back: v 2.0 came with optional Arrow support instead of NumPy, then it became the default engine, but performance remains way below that of Polars (e.g. in DataCamp benchmarks, official benchmarks, many blog posts for whole scripts or individual tasks).

Comparison with other frameworks

Comparisons between Polars and distributed (Dask, Ray, Spark) or GPU (RAPIDS) libraries aren’t the most pertinent since they can be used in combination with Polars and the benefits can thus be combined.

It only makes sense to compare Polars with other libraries occupying the same “niche” such as pandas or Vaex.

For Vaex, some benchmark found it twice slower, but this could have changed with recent developments.

One framework performing better than Polars in some benchmarks is datatable (derived from the R package data.table), but it hasn’t been developed for a year—a sharp contrast with the fast development of Polars.

Table visualization

While pandas comes with internal capabilities to make publication ready tables, Polars integrates very well with great-tables to achieve the same goal.