Lazy evaluation

Author

Marie-Hélène Burle

When it comes to high-performance computing, one of the strengths of Polars is that it supports lazy evaluation. Lazy evaluation instantly returns a future that can be used without waiting for the result of the computation. Moreover, when you run queries on a LazyFrame, Polars creates a graph and runs optimizations on it, very much the way compiled languages work.

If you want to speedup your code, use lazy execution whenever possible.

Reading in data to a LazyFrame

Ideally, you want to use the lazy API from the start, when you read in the data.

In the previous examples, we used polars.read_csv to read our data. This returns a Polars DataFrame:

import polars as pl

url = "https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv"

df = pl.read_csv(url)
type(df)
polars.dataframe.frame.DataFrame

Instead, you can use polars.scan_csv to create a LazyFrame:

df_lazy = pl.scan_csv(url)
type(df_lazy)
polars.lazyframe.frame.LazyFrame

There are scan functions for all the IO methods Polars offers.

Converting to a LazyFrame

If you already have a DataFrame, you can create a LazyFrame from it with the polars.DataFrame.lazy method:

df_lazy = df.lazy()

Getting the results

To get results from a LazyFrame, you use polars.LazyFrame.collect.

This won’t work because a LazyFrame has no attribute shape:

df_lazy.filter(pl.col("Year") == 2001).shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 1
----> 1 df_lazy.filter(pl.col("Year") == 2001).shape

AttributeError: 'LazyFrame' object has no attribute 'shape'

You need to collect the result first:

df_lazy.filter(pl.col("Year") == 2001).collect().shape
(9, 3)

collect turns your LazyFrame into a DataFrame, but it only does so on the subset needed for your query:

type(df_lazy.filter(pl.col("Year") == 2001).collect())
polars.dataframe.frame.DataFrame

This allows you to work with data too big to fit in memory!

Data too big to fit in memory

Example of large dataset

Let’s play with data from the GBIF website (free and open access biodiversity database). The Southern African Bird Atlas Project 2 [1] contains a CSV file of 12.3 GB.

The data contains 25,687,526 rows and 50 columns.

From that dataset, I want a list of species from the genus Passer (Old World sparrows).

The importance of file format

Text-based formats (CSV, JSON) are not suitable for large files. Apache Parquet is a binary, machine-optimized, column-oriented file format with efficient encoding and compression and it is considered the industry-standard file format for large tabular data.

I converted the file to Parquet and copied it to our training cluster at /project/def-sponsor00/data/sa_birds.parquet.

Here is how much smaller the Parquet file is:

File type Size
File downloaded from GBIF Zipped CSV 2.2 GB
Uncompressed file CSV 12.3 GB
Ideal file format Parquet 0.5 GB

In addition to being 15 times (!!) smaller, it is a lot faster to read & write. With its native support for the columnar Apache Arrow format, Polars is ideal to work with Parquet files.

Now, while that file is very small, once you read it in Python, it cannot remain compressed and it will return to its full size of 12 GB. That is a lot more memory than we have on the training cluster (we only have 3600 MB in our current JupyterHub session!)

What are your options?

When the data is too big to fit in memory, you can:

  • read in data only for the columns you are interested in,
  • read and process the file in chunks and try to combine the results,
  • use the Polars lazy API.

Polars lazy API is the best solution

Create a LazyFrame

You can create a LazyFrame (no impact on memory, LazyFrame returned instantly):

df_pl_lazy = pl.scan_parquet('/project/def-sponsor00/data/sa_birds.parquet')

Inspect it by collecting wisely

You can inspect your LazyFrame by collecting small queries on it. Don’t query too much of it at a time for your memory of course and don’t try to collect the whole LazyFrame as this would defeat the whole purpose of using the lazy API and you would get into a OOM crash.

For instance, if you want to print the first few rows (and also get info on the column names and their data types), you can do that no problem (the data contains over 25 million rows and that is way too big for our current memory, but a few rows are of course more than fine:

df_pl_lazy.head().collect()
shape: (5, 50)
gbifID datasetKey occurrenceID kingdom phylum class order family genus species infraspecificEpithet taxonRank scientificName verbatimScientificName verbatimScientificNameAuthorship countryCode locality stateProvince occurrenceStatus individualCount publishingOrgKey decimalLatitude decimalLongitude coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy depth depthAccuracy eventDate day month year taxonKey speciesKey basisOfRecord institutionCode collectionCode catalogNumber recordNumber identifiedBy dateIdentified license rightsHolder recordedBy typeStatus establishmentMeans lastInterpreted mediaType issue
i64 str str str str str str str str str str str str str str str str str str str str f64 f64 str str str str str str str i64 i64 i64 i64 i64 str str str str str str str str str str str str str str str
3867289255 "906e6978-e292-4a8b-9c39-adf6bb… "urn:fiao:sabap2:fullprot:rid18… "Animalia" "Chordata" "Aves" "Passeriformes" "Muscicapidae" "Bradornis" "Bradornis pallidus" null "SPECIES" "Bradornis pallidus (J.W.von Mü… null null "ZA" null "Mpumalanga" "PRESENT" null "dd862d06-e6e9-4ab9-bc86-c875cc… -24.79125 31.457917 null null null null null null "2022-07-21" 21 7 2022 2492639 2492639 "HUMAN_OBSERVATION" "FIAO" "SABAP2" "urn:fiao:sabap2:fullprot:rid18… null "Mr L Hes" null "CC_BY_4_0" null "Mr L Hes" null null "2026-03-07T10:46:26.735Z" null "COORDINATE_ROUNDED;GEODETIC_DA…
2341252758 "906e6978-e292-4a8b-9c39-adf6bb… "urn:fiao:sabap2:fullprot:rid25… "Animalia" "Chordata" "Aves" "Charadriiformes" "Burhinidae" "Burhinus" "Burhinus capensis" null "SPECIES" "Burhinus capensis (M.H.K.Licht… null null "ZA" null "Limpopo" "PRESENT" null "dd862d06-e6e9-4ab9-bc86-c875cc… -23.874583 29.457917 null null null null null null "2011-01-07" 7 1 2011 2482097 2482097 "HUMAN_OBSERVATION" "FIAO" "SABAP2" "urn:fiao:sabap2:fullprot:rid25… null "Prof J Pretorius" null "CC_BY_4_0" null "Prof J Pretorius" null null "2026-03-07T10:46:53.556Z" null "COORDINATE_ROUNDED;GEODETIC_DA…
3867442137 "906e6978-e292-4a8b-9c39-adf6bb… "urn:fiao:sabap2:fullprot:rid18… "Animalia" "Chordata" "Aves" "Passeriformes" "Platysteiridae" "Batis" "Batis molitor" null "SPECIES" "Batis molitor (Kuster, 1836)" null null "ZA" null "Limpopo" "PRESENT" null "dd862d06-e6e9-4ab9-bc86-c875cc… -23.957917 31.124583 null null null null null null "2022-07-20" 20 7 2022 5231186 5231186 "HUMAN_OBSERVATION" "FIAO" "SABAP2" "urn:fiao:sabap2:fullprot:rid18… null "Mr P Verster" null "CC_BY_4_0" null "Mr P Verster" null null "2026-03-07T10:46:26.735Z" null "COORDINATE_ROUNDED;GEODETIC_DA…
2347570158 "906e6978-e292-4a8b-9c39-adf6bb… "urn:fiao:sabap2:fullprot:rid88… "Animalia" "Chordata" "Aves" "Coraciiformes" "Alcedinidae" "Halcyon" "Halcyon leucocephala" null "SPECIES" "Halcyon leucocephala (P.L.S.Mü… null null "ZA" null "Limpopo" "PRESENT" null "dd862d06-e6e9-4ab9-bc86-c875cc… -24.29125 30.624583 null null null null null null "2016-10-17" 17 10 2016 5228304 5228304 "HUMAN_OBSERVATION" "FIAO" "SABAP2" "urn:fiao:sabap2:fullprot:rid88… null "Mr R Hawkins" null "CC_BY_4_0" null "Mr R Hawkins" null null "2026-03-07T10:47:07.113Z" null "COORDINATE_ROUNDED;GEODETIC_DA…
3867442155 "906e6978-e292-4a8b-9c39-adf6bb… "urn:fiao:sabap2:fullprot:rid18… "Animalia" "Chordata" "Aves" "Passeriformes" "Malaconotidae" "Chlorophoneus" "Chlorophoneus sulfureopectus" null "SPECIES" "Chlorophoneus sulfureopectus (… null null "ZA" null "Limpopo" "PRESENT" null "dd862d06-e6e9-4ab9-bc86-c875cc… -22.374583 31.207917 null null null null null null "2022-07-18" 18 7 2022 5845131 5845131 "HUMAN_OBSERVATION" "FIAO" "SABAP2" "urn:fiao:sabap2:fullprot:rid18… null "Mr R Hawkins" null "CC_BY_4_0" null "Mr R Hawkins" null null "2026-03-07T10:46:26.736Z" null "COORDINATE_ROUNDED;GEODETIC_DA…

Note that you cannot run df_pl_lazy.sample on a LazyFrame:

import random
df_pl_lazy.sample(5).collect()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 2
      1 import random
----> 2 df_pl_lazy.sample(5).collect()

AttributeError: 'LazyFrame' object has no attribute 'sample'

That’s because Polars would have to access the whole data to draw a random sample.

If you only want to print the column names without having to collect any row at all you, you can collect the schema and get the column names from it:

print(df_pl_lazy.collect_schema().names())
['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'infraspecificEpithet', 'taxonRank', 'scientificName', 'verbatimScientificName', 'verbatimScientificNameAuthorship', 'countryCode', 'locality', 'stateProvince', 'occurrenceStatus', 'individualCount', 'publishingOrgKey', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation', 'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day', 'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord', 'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber', 'identifiedBy', 'dateIdentified', 'license', 'rightsHolder', 'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted', 'mediaType', 'issue']

Do not use the columns attribute on the LazyFrame as it will raise the following performance warning:

Determining the column names of a LazyFrame requires resolving its schema,
which is a potentially expensive operation.
Use `LazyFrame.collect_schema().names()` to get the column names without this warning.

Write a query on the LazyFrame

Run a query on it (no impact on memory, LazyFrame returned instantly):

passer_df_lazy = df_pl_lazy.filter(
    pl.col('genus') == 'Passer'
).select(pl.col('species')).unique()

Collect the result

Now, you collect the result (this uses memory and takes time to compute) and turn the DataFrame into a Python list:

passer_ls = passer_df_lazy.collect().get_column('species').to_list()

passer_ls
['Passer motitensis',
 'Passer griseus',
 'Passer diffusus',
 'Passer domesticus',
 'Passer melanurus']

With this method, as long as the data itself fits on a drive and the queries don’t return huge subsets of data, you can run queries on giant datasets such as the 3 TB eBird dataset with a very small amount of RAM.

RAM actually needed

I did a little experiment on a cluster: I ran the previous code on a single CPU core and gradually reduced the amount of memory asked from Slurm until I got an OOM error.

I got the result with as little as 150 MB of memory.

References

1.
GBIF.org (2026) Occurrence download