import polars as pl
url = "https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv"
df = pl.read_csv(url)
type(df)polars.dataframe.frame.DataFrame
Marie-Hélène Burle
When it comes to high-performance computing, one of the strengths of Polars is that it supports lazy evaluation. Lazy evaluation instantly returns a future that can be used without waiting for the result of the computation. Moreover, when you run queries on a LazyFrame, Polars creates a graph and runs optimizations on it, very much the way compiled languages work.
If you want to speedup your code, use lazy execution whenever possible.
Ideally, you want to use the lazy API from the start, when you read in the data.
In the previous examples, we used polars.read_csv to read our data. This returns a Polars DataFrame:
polars.dataframe.frame.DataFrame
Instead, you can use polars.scan_csv to create a LazyFrame:
There are scan functions for all the IO methods Polars offers.
If you already have a DataFrame, you can create a LazyFrame from it with the polars.DataFrame.lazy method:
To get results from a LazyFrame, you use polars.LazyFrame.collect.
This won’t work because a LazyFrame has no attribute shape:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 df_lazy.filter(pl.col("Year") == 2001).shape AttributeError: 'LazyFrame' object has no attribute 'shape'
You need to collect the result first:
collect turns your LazyFrame into a DataFrame, but it only does so on the subset needed for your query:
This allows you to work with data too big to fit in memory!
Let’s play with data from the GBIF website (free and open access biodiversity database). The Southern African Bird Atlas Project 2 [1] contains a CSV file of 12.3 GB.
The data contains 25,687,526 rows and 50 columns.
From that dataset, I want a list of species from the genus Passer (Old World sparrows).
Text-based formats (CSV, JSON) are not suitable for large files. Apache Parquet is a binary, machine-optimized, column-oriented file format with efficient encoding and compression and it is considered the industry-standard file format for large tabular data.
I converted the file to Parquet and copied it to our training cluster at /project/def-sponsor00/data/sa_birds.parquet.
Here is how much smaller the Parquet file is:
| File type | Size | |||||
|---|---|---|---|---|---|---|
| File downloaded from GBIF | Zipped CSV | 2.2 GB | ||||
| Uncompressed file | CSV | 12.3 GB | ||||
| Ideal file format | Parquet | 0.5 GB |
In addition to being 15 times (!!) smaller, it is a lot faster to read & write. With its native support for the columnar Apache Arrow format, Polars is ideal to work with Parquet files.
Now, while that file is very small, once you read it in Python, it cannot remain compressed and it will return to its full size of 12 GB. That is a lot more memory than we have on the training cluster (we only have 3600 MB in our current JupyterHub session!)
When the data is too big to fit in memory, you can:
You can create a LazyFrame (no impact on memory, LazyFrame returned instantly):
You can inspect your LazyFrame by collecting small queries on it. Don’t query too much of it at a time for your memory of course and don’t try to collect the whole LazyFrame as this would defeat the whole purpose of using the lazy API and you would get into a OOM crash.
For instance, if you want to print the first few rows (and also get info on the column names and their data types), you can do that no problem (the data contains over 25 million rows and that is way too big for our current memory, but a few rows are of course more than fine:
| gbifID | datasetKey | occurrenceID | kingdom | phylum | class | order | family | genus | species | infraspecificEpithet | taxonRank | scientificName | verbatimScientificName | verbatimScientificNameAuthorship | countryCode | locality | stateProvince | occurrenceStatus | individualCount | publishingOrgKey | decimalLatitude | decimalLongitude | coordinateUncertaintyInMeters | coordinatePrecision | elevation | elevationAccuracy | depth | depthAccuracy | eventDate | day | month | year | taxonKey | speciesKey | basisOfRecord | institutionCode | collectionCode | catalogNumber | recordNumber | identifiedBy | dateIdentified | license | rightsHolder | recordedBy | typeStatus | establishmentMeans | lastInterpreted | mediaType | issue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | f64 | f64 | str | str | str | str | str | str | str | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
| 3867289255 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Muscicapidae" | "Bradornis" | "Bradornis pallidus" | null | "SPECIES" | "Bradornis pallidus (J.W.von Mü… | null | null | "ZA" | null | "Mpumalanga" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -24.79125 | 31.457917 | null | null | null | null | null | null | "2022-07-21" | 21 | 7 | 2022 | 2492639 | 2492639 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr L Hes" | null | "CC_BY_4_0" | null | "Mr L Hes" | null | null | "2026-03-07T10:46:26.735Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2341252758 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid25… | "Animalia" | "Chordata" | "Aves" | "Charadriiformes" | "Burhinidae" | "Burhinus" | "Burhinus capensis" | null | "SPECIES" | "Burhinus capensis (M.H.K.Licht… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -23.874583 | 29.457917 | null | null | null | null | null | null | "2011-01-07" | 7 | 1 | 2011 | 2482097 | 2482097 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid25… | null | "Prof J Pretorius" | null | "CC_BY_4_0" | null | "Prof J Pretorius" | null | null | "2026-03-07T10:46:53.556Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 3867442137 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Platysteiridae" | "Batis" | "Batis molitor" | null | "SPECIES" | "Batis molitor (Kuster, 1836)" | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -23.957917 | 31.124583 | null | null | null | null | null | null | "2022-07-20" | 20 | 7 | 2022 | 5231186 | 5231186 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr P Verster" | null | "CC_BY_4_0" | null | "Mr P Verster" | null | null | "2026-03-07T10:46:26.735Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2347570158 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid88… | "Animalia" | "Chordata" | "Aves" | "Coraciiformes" | "Alcedinidae" | "Halcyon" | "Halcyon leucocephala" | null | "SPECIES" | "Halcyon leucocephala (P.L.S.Mü… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -24.29125 | 30.624583 | null | null | null | null | null | null | "2016-10-17" | 17 | 10 | 2016 | 5228304 | 5228304 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid88… | null | "Mr R Hawkins" | null | "CC_BY_4_0" | null | "Mr R Hawkins" | null | null | "2026-03-07T10:47:07.113Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 3867442155 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Malaconotidae" | "Chlorophoneus" | "Chlorophoneus sulfureopectus" | null | "SPECIES" | "Chlorophoneus sulfureopectus (… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -22.374583 | 31.207917 | null | null | null | null | null | null | "2022-07-18" | 18 | 7 | 2022 | 5845131 | 5845131 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr R Hawkins" | null | "CC_BY_4_0" | null | "Mr R Hawkins" | null | null | "2026-03-07T10:46:26.736Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
Note that you cannot run df_pl_lazy.sample on a LazyFrame:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[21], line 2 1 import random ----> 2 df_pl_lazy.sample(5).collect() AttributeError: 'LazyFrame' object has no attribute 'sample'
That’s because Polars would have to access the whole data to draw a random sample.
If you only want to print the column names without having to collect any row at all you, you can collect the schema and get the column names from it:
['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'infraspecificEpithet', 'taxonRank', 'scientificName', 'verbatimScientificName', 'verbatimScientificNameAuthorship', 'countryCode', 'locality', 'stateProvince', 'occurrenceStatus', 'individualCount', 'publishingOrgKey', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation', 'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day', 'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord', 'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber', 'identifiedBy', 'dateIdentified', 'license', 'rightsHolder', 'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted', 'mediaType', 'issue']
Do not use the columns attribute on the LazyFrame as it will raise the following performance warning:
Determining the column names of a LazyFrame requires resolving its schema,
which is a potentially expensive operation.
Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
Run a query on it (no impact on memory, LazyFrame returned instantly):
Now, you collect the result (this uses memory and takes time to compute) and turn the DataFrame into a Python list:
['Passer motitensis',
'Passer griseus',
'Passer diffusus',
'Passer domesticus',
'Passer melanurus']
With this method, as long as the data itself fits on a drive and the queries don’t return huge subsets of data, you can run queries on giant datasets such as the 3 TB eBird dataset with a very small amount of RAM.
I did a little experiment on a cluster: I ran the previous code on a single CPU core and gradually reduced the amount of memory asked from Slurm until I got an OOM error.
I got the result with as little as 150 MB of memory.