RIP pandas, welcome Polars

Marie-Hélène Burle

April 7, 2026

Goal of this webinar

There is a new and much better library for Python DataFrames

There are no downsides to using it, beside the effort of changing habits. For new users who don’t have habits yet, there are just no downsides

Yet, most Python intro courses still teach the old tool

In this webinar, I will not teach Polars and its syntax. Instead I will demo why it is better

My goal is to help shift the culture towards a wider adoption of Polars

A brief history of DataFrames

Overview


Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager	Eager and lazy
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	NumPy arrays	Apache Arrow arrays
Memory efficiency	Poor	Excellent
Missing data handling	Inconsistent	Consistent; type stability

Overview


Available for	Python	Rust, Python, R, NodeJS
Written in	Cython	Rust
Multithreading	Some operations	Yes (GIL released)
Index	Rows are indexed	Integer positions are used
Evaluation	Eager	Eager and lazy
Query optimizer	No	Yes
Out-of-core	No	Yes
SIMD vectorization	Yes	Yes
Data in memory	NumPy arrays	Apache Arrow arrays
Memory efficiency	Poor	Excellent
Missing data handling	Inconsistent	Consistent; type stability

Missing data

For historical reasons, depending on dtype, missing values are one of:

NaN (numpy.nan, dtype float)
None
NaT (Not-a-Time, dtype datetime)
pandas.NA (nullable scalar)

This leads to inconsistent behaviour and unexpected dtype changes

Ongoing open issue since 2020

One single value for missing data: null (dtype polars.Null)

Simple and consistent behaviour across all data types and better performance

NaN exists and is a float. It is not missing data, but the result of mathematical operations that don’t return numbers

Here is a blog that goes into this in details

Group by

Here is a great blog post by Marco Gorelli comparing the ease of use of passing expressions to groups in Polars compared to pandas

I am using his code here and doing some timing on it to look at efficiency

Create DataFrames

Import libraries:

import pandas as pd
import polars as pl

The pandas DataFrame:

df_pd = pd.DataFrame(
    {
        'id': [1, 1, 1, 2, 2, 2],
        'sales': [4, 1, 2, 7, 6, 7],
        'views': [3, 1, 2, 8, 6, 7]
    }
)

The Polars equivalent:

df_pl = pl.DataFrame(
    {
        'id': [1, 1, 1, 2, 2, 2],
        'sales': [4, 1, 2, 7, 6, 7],
        'views': [3, 1, 2, 8, 6, 7]
    }
)

The pandas DataFrame:

df_pd

	id	sales	views
0	1	4	3
1	1	1	1
2	1	2	2
3	2	7	8
4	2	6	6
5	2	7	7

The Polars equivalent:

df_pl

shape: (6, 3)

id	sales	views
i64	i64	i64
1	4	3
1	1	1
1	2	2
2	7	8
2	6	6
2	7	7

Marco Gorelli’s example

For simple group by operations, both libraries have an easy syntax. Where it gets interesting is for more complex ones such as:

Find the maximum value of views, where sales is greater than its mean, per id

Solutions in pandas

The straightforward method involves an anonymous function passed to apply:

df_pd.groupby('id').apply(
    lambda df_pd: df_pd[df_pd['sales'] > df_pd['sales'].mean()]['views'].max()
)

id
1    3
2    8
dtype: int64

Solutions in pandas

Another option, quite tortuous and requiring 2 groupby expressions:

df_pd[
    df_pd['sales'] > df_pd.groupby('id')['sales'].transform('mean')
].groupby('id')['views'].max()

id
1    3
2    8
Name: views, dtype: int64

Solutions in pandas

A third solution involving masks:

gb = df_pd.groupby('id')
mask = df_pd['sales'] > gb['sales'].transform('mean')
df_pd['result'] = df_pd['views'].where(mask)
gb['result'].max()

id
1    3.0
2    8.0
Name: result, dtype: float64

The solution in Polars

df_pl.group_by('id').agg(
    pl.col('views').filter(pl.col('sales') > pl.mean('sales')).max()
)

shape: (2, 2)

id	views
i64	i64
2	8
1	3

A lot easier to wrap one’s head around …

Timing of pandas solutions

%%timeit

df_pd.groupby('id').apply(
    lambda df_pd: df_pd[df_pd['sales'] > df_pd['sales'].mean()]['views'].max()
)

1.04 ms ± 1.33 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The solution most people are likely to use is by far the slowest

Timing of pandas solutions

%%timeit

df_pd[
    df_pd['sales'] > df_pd.groupby('id')['sales'].transform('mean')
].groupby('id')['views'].max()

678 μs ± 4.04 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The best solution is the tortuous one with 2 groupby expressions

Timing of pandas solutions

%%timeit

gb = df_pd.groupby('id')
mask = df_pd['sales'] > gb['sales'].transform('mean')
df_pd['result'] = df_pd['views'].where(mask)
gb['result'].max()

711 μs ± 952 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Masking is almost as fast as the best solution

Timing of the Polars solution

%%timeit

df_pl.group_by('id').agg(
    pl.col('views').filter(pl.col('sales') > pl.mean('sales')).max()
)

110 μs ± 3.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In addition to be more straightforward, Polars is also much faster

Conclusion

Speedup of Polars compared to the best pandas method: 6
Compared to the method most people are likely to use: 9

To pass expressions to groups, Polars has a more straightforward and efficient syntax

General performance comparisons

A note on my comparisons

The performance of tools can only be evaluated on optimized code: poorly written pandas code is a lot slower than efficient pandas code

To ensure fairness towards pandas, I use the best pandas code from Alex Razoumov pandas course in which he benchmarks various syntaxes

Example 1: FizzBuzz

FizzBuzz is a programming exercise based on a children game which consists of counting to n, replacing:

any number divisible by 3 with Fizz
any number divisible by 5 with Buzz
any number divisible by both 3 and 5 with FizzBuzz

Let’s do it:

# Load additional library
import numpy as np

# Set series length
n = 10_000

Code

df_pd = pd.DataFrame()

df_pd['Count'] = np.arange(1, n+1)

df_pd['FizzBuzz'] = df_pd['Count'].astype(str)

df_pd.loc[df_pd['Count'] % 3 == 0, 'FizzBuzz'] = 'Fizz'
df_pd.loc[df_pd['Count'] % 5 == 0, 'FizzBuzz'] = 'Buzz'
df_pd.loc[df_pd['Count'] % 15 == 0, 'FizzBuzz'] = 'FizzBuzz'

df_pl = pl.DataFrame({'Count': np.arange(1, n+1)})

df_pl.with_columns(pl.col('Count').cast(pl.String).alias('FizzBuzz'))

df_pl = df_pl.with_columns(
    pl.when(pl.col('Count') % 3 == 0).then(pl.lit('Fizz'))
    .when(pl.col('Count') % 5 == 0).then(pl.lit('Buzz'))
    .when(pl.col('Count') % 15 == 0).then(pl.lit('FizzBuzz'))
    .otherwise(pl.col('Count')).alias('FizzBuzz')
)

Lazy evaluation

But it gets much better with lazy evaluation

First, we create a LazyFrame instead of a DataFrame. The query is not evaluated but a graph is created. This allows the query optimizer to fuse operations and perform optimizations where possible the way compilers do

To evaluate the query and get a result, we use the collect method

# Lazy API

df_pl_lazy = pl.LazyFrame({'Count': np.arange(1, n+1)})

df_pl_lazy.with_columns(pl.col('Count').cast(pl.String).alias('FizzBuzz'))

df_pl_lazy = df_pl_lazy.with_columns(
    pl.when(pl.col('Count') % 3 == 0).then(pl.lit('Fizz'))
    .when(pl.col('Count') % 5 == 0).then(pl.lit('Buzz'))
    .when(pl.col('Count') % 15 == 0).then(pl.lit('FizzBuzz'))
    .otherwise(pl.col('Count')).alias('FizzBuzz')
)

df_pl_lazy_collected = df_pl_lazy.collect()

Results

df_pd.head()   # pandas

	Count	FizzBuzz
0	1	1
1	2	2
2	3	Fizz
3	4	4
4	5	Buzz

df_pl.head()
# same for df_pl_lazy_collected

shape: (5, 2)

Count	FizzBuzz
i64	str
1	"1"
2	"2"
3	"Fizz"
4	"4"
5	"Buzz"

Both work

Timing

%%timeit

df_pd = pd.DataFrame()

df_pd['Count'] = np.arange(1, n+1)

df_pd['FizzBuzz'] = df_pd['Count'].astype(str)

df_pd.loc[df_pd['Count'] % 3 == 0, 'FizzBuzz'] = 'Fizz'
df_pd.loc[df_pd['Count'] % 5 == 0, 'FizzBuzz'] = 'Buzz'
df_pd.loc[df_pd['Count'] % 15 == 0, 'FizzBuzz'] = 'FizzBuzz'

3.4 ms ± 65.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit

df_pl = pl.DataFrame({'Count': np.arange(1, n+1)})

df_pl.with_columns(pl.col('Count').cast(pl.String).alias('FizzBuzz'))

df_pl.with_columns(
    pl.when(pl.col('Count') % 3 == 0).then(pl.lit('Fizz'))
    .when(pl.col('Count') % 5 == 0).then(pl.lit('Buzz'))
    .when(pl.col('Count') % 15 == 0).then(pl.lit('FizzBuzz'))
    .otherwise(pl.col('Count')).alias('FizzBuzz')
)

648 μs ± 2.21 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit

# Lazy API

df_pl_lazy = pl.LazyFrame({'Count': np.arange(1, n+1)})

df_pl_lazy.with_columns(pl.col('Count').cast(pl.String).alias('FizzBuzz'))

df_pl_lazy = df_pl_lazy.with_columns(
    pl.when(pl.col('Count') % 3 == 0).then(pl.lit('Fizz'))
    .when(pl.col('Count') % 5 == 0).then(pl.lit('Buzz'))
    .when(pl.col('Count') % 15 == 0).then(pl.lit('FizzBuzz'))
    .otherwise(pl.col('Count')).alias('FizzBuzz')
)

df_pl = df_pl_lazy.collect()

462 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Speedup

Speedup with Polars DataFrame: 5
With Polars LazyFrame: 7

These speedups highly depend on the types and sizes of the problems

How much faster Polars is does vary, what doesn’t vary is that it is always much faster

Memory usage

pandas

df_pd.memory_usage()

Index         132
Count       80000
FizzBuzz    80000
dtype: int64

132 + 80,000 * 2 =
160,132 bytes
≈ 160 KB

Polars

df_pl.estimated_size()

119,409 bytes
≈ 119 KB

160,132 / 119,409 ≈ 1.3
Footprint 1.3 times smaller

Polars lazy API

Same as Polars eager:
if you collect the entire DataFrame as we did here, the lazy API does not reduce the memory footprint

Example 2: medium dataset

Here I am using a jeopardy dataset that Alex uses in his pandas course

Shape: 216,930 rows, 7 columns

The goal is to subset this DataFrame for the history category and get the shape of the resulting DataFrame

Code

df_pd = pd.read_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pd_subset = df_pd.loc[df_pd['Category'] == 'HISTORY']

df_pd_subset.shape

(349, 7)

df_pl = pl.read_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pl_subset = df_pl.filter(pl.col('Category') == 'HISTORY')

df_pl_subset.shape

(349, 7)

# Lazy API

df_pl_lazy = pl.scan_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pl_lazy_subset_collected = df_pl_lazy.filter(
    pl.col('Category') == 'HISTORY'
).collect()

df_pl_lazy_subset_collected.shape

(349, 7)

To create a LazyFrame from file, use one of the scan_* instead of the read_* methods

Timing

%%timeit

df_pd = pd.read_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pd.loc[df_pd['Category'] == 'HISTORY'].shape

1.17 s ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

df_pl = pl.read_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pl.filter(pl.col('Category') == 'HISTORY').shape

778 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

# Lazy API

df_pl_lazy = pl.scan_csv(
    'https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv'
)

df_pl_lazy.filter(pl.col('Category') == 'HISTORY').collect().shape

67.7 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Speedup

Polars here brings a speedup of 1.5

The lazy API ramps that up to a speedup of 17

Memory usage

pandas

df_pd.memory_usage()

Index              132
Show Number    1735440
Air Date       1735440
Round          1735440
Category       1735440
Value          1735440
Question       1735440
Answer         1735440
dtype: int64

132 + 1,735,440 * 7 =
12,148,212 bytes
≈ 12 MB

Polars

df_pl.estimated_size()

31606250

31,606,250 bytes
≈ 31 MB

31,606,250 / 12,148,212 ≈ 2.6
Footprint 2.6 times bigger

Polars lazy API

df_pl_lazy_subset_collected.estimated_size()

48,774 bytes
≈ 49 KB

12,148,212 / 48,774 ≈ 249
Footprint 249 times smaller

If you only collect a subset of the data, the memory footprint drops dramatically

Data too big to fit in memory

Example of large dataset

Let’s play with data from the GBIF website (free and open access biodiversity database)

The Southern African Bird Atlas Project 2 [1] contains a raw CSV file of 42 GB and an interpreted CSV file of 12.3 GB. Let’s look at the latter:

Shape: 25,687,526 rows, 50 columns

From that dataset, I want a list of species from the genus Passer (Old World sparrows)

Side note: the importance of file format

Text-based formats (CSV, JSON) are not suitable for large files

Apache Parquet is a binary, machine-optimized, column-oriented file format with efficient encoding and compression and it is considered the industry-standard file format for large tabular data

	File type	Size
File downloaded from GBIF	Zipped CSV	2.2 GB
Uncompressed file	CSV	12.3 GB
Ideal file format	Parquet	0.5 GB

In addition to being 15 times (!!) smaller, the Parquet file is a lot faster to read & write

Format conversion

Here is a way to convert this file if it fits in memory:

# Read in the CSV file in a Polars DataFrame
df = pl.read_csv('sa_birds.csv', separator='\t')

# Export the DataFrame in a Parquet file
df.write_parquet('sa_birds.parquet')

If the file doesn’t fit in memory, you can use Dask (which also uses Apache Arrow) on a distributed system

Native Apache Arrow support

With its native support for the columnar Apache Arrow format, Polars is ideal to work with Parquet files

pandas can work with Parquet files but this can lead to issues due to the inconsistent way it handles missing data (e.g. see this SO question, this issue, this post)

%%timeit

# pandas
sa_birds_pd = pd.read_parquet(
    'sa_birds.parquet'
)

1min 10s ± 7.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

# Polars
sa_birds_pl = pl.read_parquet(
    'sa_birds.parquet'
)

2.47 s ± 756 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Speedup: 70 / 2.47 ≈ 28 Reading Parquet files in pandas is also A LOT slower!

RAM required

pandas

sa_birds_pd.memory_usage()

Index                          132
gbifID                   205500208
datasetKey               205500208
occurrenceID             205500208
kingdom                  205500208
phylum                   205500208
class                    205500208
order                    205500208
...
dtype: int64

132 + 205,500,208 * 50 =
10,275,010,532 bytes
≈ 10 GB

Polars

sa_birds_pl.estimated_size()

12291023880

12,291,023,880 bytes
≈ 12 GB

RAM required

Let’s image that you have only 16 GB of RAM on your machine

Running a browser and a few small applications take about 6 GB out of that. This dataset is already too big to process on your laptop

If you need to play with the raw data or a larger dataset such as the eBird dataset of 1,775,781,186 rows and 50 columns (CSV file of 3 TB for the raw data), you will need A LOT of memory

What are your options?

In pandas or Polars:

Read in data only for the columns you are interested in
Read and process the file in chunks and try to combine the results

If you only need to run queries on the data however, the Polars lazy API is the answer

Create a LazyFrame

You can create a LazyFrame (no impact on memory, LazyFrame returned instantly):

df_pl_lazy = pl.scan_parquet('sa_birds.parquet')

Run a query on it (no impact on memory, LazyFrame returned instantly):

passer_df_lazy = df_pl_lazy.filter(
    pl.col('genus') == 'Passer'
).select(pl.col('species')).unique()

Collect the result

Now, you collect the result (this uses memory and takes time to compute) and turn the DataFrame into a Python list:

passer_ls = passer_df_lazy.collect().get_column('species').to_list()

Depending on the subset returned by your query, the memory usage will vary. Here, the collected result is minuscule (79 bytes):

passer_df_lazy.collect().estimated_size()

You need more than that to actually run the code, but not that much more (see later)

Et voilà

print(passer_ls)

['Passer diffusus', 'Passer domesticus', 'Passer griseus', 'Passer motitensis', 'Passer melanurus']

With this method, as long as the data itself fits on a drive and the queries don’t return huge subsets of data, you can run queries on giant datasets such as the 3 TB eBird dataset with a very small amount of RAM

Note that the queries can contain complex expressions. What matters is the size of the subset you need to collect at the end

How much RAM is actually needed?

I did a little experiment on a cluster:

I ran the previous code on a single CPU core and gradually reduced the amount of memory asked from Slurm until I got an OOM error

I got the result with as little as 150 MB of memory

Remember that this DataFrame would require 10 GB of RAM just to be read in pandas, not counting for any computation

Conclusion

Polars:

Handles missing data in a consistent and sensible way
Has a clear syntax
Allows passing expressions to group by in a convenient way
Is very fast
Uses multithreading automatically
Is extremely memory efficient with the lazy API
Uses the industry standard Apache Arrow columnar format
Works perfectly with Parquet files
Works out-of-core with the lazy API

Conclusion

Start using Polars now!

And use the lazy API!

Reference

GBIF.org (2026) Occurrence download