Marie-Hélène Burle
April 7, 2026
There is a new and much better library for Python DataFrames
There are no downsides to using it, beside the effort of changing habits. For new users who don’t have habits yet, there are no downsides
Yet, all Python intro courses still teach the old tool
In this webinar, I will not teach Polars and its syntax. Instead I will demo why it is better
My goal is to help shift the culture towards a wider adoption of Polars instead of pandas
| Available for | Python | Rust, Python, R, NodeJS | |
| Written in | Cython | Rust | |
| Multithreading | Some operations | Yes (GIL released) | |
| Index | Rows are indexed | Integer positions are used | |
| Evaluation | Eager | Eager and lazy | |
| Query optimizer | No | Yes | |
| Out-of-core | No | Yes | |
| SIMD vectorization | Yes | Yes | |
| Data in memory | NumPy arrays | Apache Arrow arrays | |
| Memory efficiency | Poor | Excellent | |
| Missing data handling | Inconsistent | Consistent; type stability |
For historical reasons, depending on dtype, missing values are one of:
NaN (numpy.nan, dtype float)NoneNaT (Not-a-Time, dtype datetime)pandas.NA (nullable scalar)This leads to inconsistent behaviour and unexpected dtype changes
One single missing value: null
Simple and consistent behaviour across all data types and better performance
NaN exists and is a float. It is not missing data, but the result of mathematical operations that don’t return numbers
Here is a blog that goes into this in details
Here is a great blog post by Marco Gorelli comparing the ease of use of passing expressions to groups in Polars compared to pandas
I am using his code here and doing some timing on it to look at efficiency
Import libraries:
Find the maximum value of
views, wheresalesis greater than its mean, perid
The straightforward method most people will use:
Another option, but it requires 2 groupby expressions:
A solution people are unlikely to come up with:
| id | views |
|---|---|
| i64 | i64 |
| 2 | 8 |
| 1 | 3 |
Much simpler …
1.04 ms ± 1.33 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The solution most people are likely to use is by far the slowest
678 μs ± 4.04 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The best solution uses 2 groupby expressions
711 μs ± 952 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is almost as fast as the best solution
110 μs ± 3.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Polars is more straightforward and much faster
Speedup of Polars compared to the best pandas method: 6
Compared to the method most people are likely to use: 9
To pass expressions to groups, Polars has a more straightforward and efficient syntax
The performance of tools can only be evaluated on optimized code: poorly written pandas code is a lot slower than efficient pandas code
To ensure fairness towards pandas, I use the best pandas code from Alex Razoumov pandas course in which he benchmarks various syntaxes
FizzBuzz is a programming exercise based on a children game which consists of counting to n, replacing:
3 with Fizz5 with Buzz3 and 5 with FizzBuzzLet’s do it:
df_pl = pl.DataFrame({"Count": np.arange(1, n+1)})
df_pl.with_columns(pl.col("Count").cast(pl.String).alias("FizzBuzz"))
df_pl = df_pl.with_columns(
pl.when(pl.col("Count") % 3 == 0).then(pl.lit("Fizz"))
.when(pl.col("Count") % 5 == 0).then(pl.lit("Buzz"))
.when(pl.col("Count") % 15 == 0).then(pl.lit("FizzBuzz"))
.otherwise(pl.col("Count")).alias("FizzBuzz")
)But it gets much better with lazy evaluation
First, we create a LazyFrame instead of a DataFrame. The query is not evaluated but a graph is created. This allows the query optimizer to combine operations and perform optimizations where possible, very much the way compilers work
To evaluate the query and get a result, we use the collect method
# Lazy API
df_pl_lazy = pl.LazyFrame({"Count": np.arange(1, n+1)})
df_pl_lazy.with_columns(pl.col("Count").cast(pl.String).alias("FizzBuzz"))
df_pl_lazy = df_pl_lazy.with_columns(
pl.when(pl.col("Count") % 3 == 0).then(pl.lit("Fizz"))
.when(pl.col("Count") % 5 == 0).then(pl.lit("Buzz"))
.when(pl.col("Count") % 15 == 0).then(pl.lit("FizzBuzz"))
.otherwise(pl.col("Count")).alias("FizzBuzz")
)
df_pl_lazy_collected = df_pl_lazy.collect()3.4 ms ± 65.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df_pl = pl.DataFrame({"Count": np.arange(1, n+1)})
df_pl.with_columns(pl.col("Count").cast(pl.String).alias("FizzBuzz"))
df_pl.with_columns(
pl.when(pl.col("Count") % 3 == 0).then(pl.lit("Fizz"))
.when(pl.col("Count") % 5 == 0).then(pl.lit("Buzz"))
.when(pl.col("Count") % 15 == 0).then(pl.lit("FizzBuzz"))
.otherwise(pl.col("Count")).alias("FizzBuzz")
)648 μs ± 2.21 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
# Lazy API
df_pl_lazy = pl.LazyFrame({"Count": np.arange(1, n+1)})
df_pl_lazy.with_columns(pl.col("Count").cast(pl.String).alias("FizzBuzz"))
df_pl_lazy = df_pl_lazy.with_columns(
pl.when(pl.col("Count") % 3 == 0).then(pl.lit("Fizz"))
.when(pl.col("Count") % 5 == 0).then(pl.lit("Buzz"))
.when(pl.col("Count") % 15 == 0).then(pl.lit("FizzBuzz"))
.otherwise(pl.col("Count")).alias("FizzBuzz")
)
df_pl = df_pl_lazy.collect()462 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Speedup with Polars DataFrame: 5
With Polars LazyFrame: 7
These speedups highly depend on the types and sizes of the problems
How much faster Polars is does vary, what doesn’t vary is that it is always much faster
pandas
132 + 80,000 * 2 =
160,132 bytes
≈ 160 KB
Polars
119,409 bytes
≈ 119 KB
160,132 / 119,409 ≈ 1.3
Footprint 1.3 times smaller
Polars lazy API
Same as Polars eager:
if you collect the entire DataFrame as we did here, the lazy API does not reduce the memory footprint
Here I am using a jeopardy dataset that Alex uses in his pandas course
Shape: 216,930 rows, 7 columns
The goal is to subset this DataFrame for the history category and get the shape of the resulting DataFrame
(349, 7)
To create a LazyFrame from file, use one of the scan_* instead of the read_* methods
1.17 s ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
778 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.7 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Polars here brings a speedup of 1.5
The lazy API ramps that up to a speedup of 17
Index 132
Show Number 1735440
Air Date 1735440
Round 1735440
Category 1735440
Value 1735440
Question 1735440
Answer 1735440
dtype: int64
132 + 1,735,440 * 7 =
12,148,212 bytes
≈ 12 MB
31,606,250 bytes
≈ 31 MB
31,606,250 / 12,148,212 ≈ 2.6
Footprint 2.6 times bigger
Let’s play with data from the GBIF website for free and open access to biodiversity data
The Southern African Bird Atlas Project 2 [1] contains a raw CSV file of 42 GB and an interpreted CSV file of 12.3 GB. Let’s look at the latter:
Shape: 25,687,526 rows, 50 columns
From that dataset, I want a list of species from the genus Passer (Old World sparrows)
Text-based formats (CSV, JSON) are not suitable for large files
Apache Parquet is a binary, machine-optimized, column-oriented file format with efficient encoding and compression and it is considered the industry-standard file format for large tabular data
| File type | Size | |||||
|---|---|---|---|---|---|---|
| File downloaded from GBIF | Zipped CSV | 2.2 GB | ||||
| Uncompressed file | CSV | 12.3 GB | ||||
| Ideal file format | Parquet | 0.5 GB |
In addition to being 15 times (!!) smaller, the Parquet file is a lot faster to read & write
Here is a way to convert this file if it fits in memory:
If the file doesn’t fit in memory, you can use Dask (which also uses Apache Arrow) on a distributed system
With its native support for the columnar Apache Arrow format, Polars is ideal to work with Parquet files
pandas can work with Parquet files but this can lead to issues due to the inconsistent way it handles missing data (e.g. see this SO question, this issue, this post)
1min 10s ± 7.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Speedup: 70 / 2.47 ≈ 28 Reading Parquet files in pandas is also A LOT slower!
pandas
Index 132
gbifID 205500208
datasetKey 205500208
occurrenceID 205500208
kingdom 205500208
phylum 205500208
class 205500208
order 205500208
...
dtype: int64
132 + 205,500,208 * 50 =
10,275,010,532 bytes
≈ 10 GB
Let’s image that you have only 16 GB of RAM on your machine
Running a browser and a few small applications take about 6 GB out of that. This dataset is already too big to process on your laptop
If you need to play with the raw data or a larger dataset such as the eBird dataset of 1,775,781,186 rows and 50 columns (CSV file of 3 TB for the raw data), you will need A LOT of memory
In pandas or Polars:
If you only need to run queries on the data however, the Polars lazy API is the answer
You can create a LazyFrame (no impact on memory, LazyFrame returned instantly):
Run a query on it (no impact on memory, LazyFrame returned instantly):
Now, you collect the result (this uses memory and takes time to compute) and turn the DataFrame into a Python list:
Depending on the subset returned by your query, the memory usage will vary. Here, it is minuscule (79 bytes):
['Passer griseus', 'Passer diffusus', 'Passer melanurus', 'Passer domesticus', 'Passer motitensis']
With this method, as long as the data itself fits on a drive and the queries don’t return huge subsets of data, you can run queries on giant datasets such as the 3 TB eBird dataset with a small amount of RAM
Note that the queries can contain complex expressions. What matters is the size of the subset you need to collect at the end
Polars:

Reference