| gbifID | datasetKey | occurrenceID | kingdom | phylum | class | order | family | genus | species | infraspecificEpithet | taxonRank | scientificName | verbatimScientificName | verbatimScientificNameAuthorship | countryCode | locality | stateProvince | occurrenceStatus | individualCount | publishingOrgKey | decimalLatitude | decimalLongitude | coordinateUncertaintyInMeters | coordinatePrecision | elevation | elevationAccuracy | depth | depthAccuracy | eventDate | day | month | year | taxonKey | speciesKey | basisOfRecord | institutionCode | collectionCode | catalogNumber | recordNumber | identifiedBy | dateIdentified | license | rightsHolder | recordedBy | typeStatus | establishmentMeans | lastInterpreted | mediaType | issue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | f64 | f64 | str | str | str | str | str | str | str | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
| 3867289255 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Muscicapidae" | "Bradornis" | "Bradornis pallidus" | null | "SPECIES" | "Bradornis pallidus (J.W.von Mü… | null | null | "ZA" | null | "Mpumalanga" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -24.79125 | 31.457917 | null | null | null | null | null | null | "2022-07-21" | 21 | 7 | 2022 | 2492639 | 2492639 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr L Hes" | null | "CC_BY_4_0" | null | "Mr L Hes" | null | null | "2026-03-07T10:46:26.735Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2341252758 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid25… | "Animalia" | "Chordata" | "Aves" | "Charadriiformes" | "Burhinidae" | "Burhinus" | "Burhinus capensis" | null | "SPECIES" | "Burhinus capensis (M.H.K.Licht… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -23.874583 | 29.457917 | null | null | null | null | null | null | "2011-01-07" | 7 | 1 | 2011 | 2482097 | 2482097 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid25… | null | "Prof J Pretorius" | null | "CC_BY_4_0" | null | "Prof J Pretorius" | null | null | "2026-03-07T10:46:53.556Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 3867442137 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Platysteiridae" | "Batis" | "Batis molitor" | null | "SPECIES" | "Batis molitor (Kuster, 1836)" | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -23.957917 | 31.124583 | null | null | null | null | null | null | "2022-07-20" | 20 | 7 | 2022 | 5231186 | 5231186 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr P Verster" | null | "CC_BY_4_0" | null | "Mr P Verster" | null | null | "2026-03-07T10:46:26.735Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2347570158 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid88… | "Animalia" | "Chordata" | "Aves" | "Coraciiformes" | "Alcedinidae" | "Halcyon" | "Halcyon leucocephala" | null | "SPECIES" | "Halcyon leucocephala (P.L.S.Mü… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -24.29125 | 30.624583 | null | null | null | null | null | null | "2016-10-17" | 17 | 10 | 2016 | 5228304 | 5228304 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid88… | null | "Mr R Hawkins" | null | "CC_BY_4_0" | null | "Mr R Hawkins" | null | null | "2026-03-07T10:47:07.113Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 3867442155 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid18… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Malaconotidae" | "Chlorophoneus" | "Chlorophoneus sulfureopectus" | null | "SPECIES" | "Chlorophoneus sulfureopectus (… | null | null | "ZA" | null | "Limpopo" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -22.374583 | 31.207917 | null | null | null | null | null | null | "2022-07-18" | 18 | 7 | 2022 | 5845131 | 5845131 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid18… | null | "Mr R Hawkins" | null | "CC_BY_4_0" | null | "Mr R Hawkins" | null | null | "2026-03-07T10:46:26.736Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2342365420 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid35… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Alaudidae" | "Mirafra" | "Mirafra africana" | null | "SPECIES" | "Mirafra africana A.Smith, 1836" | null | null | "ZA" | null | "North West" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -26.874583 | 26.707917 | null | null | null | null | null | null | "2012-02-24" | 24 | 2 | 2012 | 9389539 | 9389539 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid35… | null | "Mrs W Strauss" | null | "CC_BY_4_0" | null | "Mrs W Strauss" | null | null | "2026-03-07T10:46:22.798Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2345700071 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid68… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Cisticolidae" | "Apalis" | "Apalis flavida" | null | "SPECIES" | "Apalis flavida (Strickland, 18… | null | null | "ZA" | null | "KwaZulu-Natal" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -27.957917 | 32.374583 | null | null | null | null | null | null | "2015-06-14" | 14 | 6 | 2015 | 2492725 | 2492725 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid68… | null | "Mr E Marais" | null | "CC_BY_4_0" | null | "Mr E Marais" | null | null | "2026-03-07T10:46:21.251Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2342366813 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid35… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Turdidae" | "Turdus" | "Turdus olivaceus" | null | "SPECIES" | "Turdus olivaceus Linnaeus, 176… | null | null | "ZA" | null | "Western Cape" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -34.124583 | 19.54125 | null | null | null | null | null | null | "2012-02-27" | 27 | 2 | 2012 | 9363452 | 9363452 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid35… | null | "Dr S Shearer" | null | "CC_BY_4_0" | null | "Dr S Shearer" | null | null | "2026-03-07T10:46:22.798Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2345703248 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid68… | "Animalia" | "Chordata" | "Aves" | "Coraciiformes" | "Alcedinidae" | "Ceryle" | "Ceryle rudis" | null | "SPECIES" | "Ceryle rudis (Linnaeus, 1758)" | null | null | "ZA" | null | "KwaZulu-Natal" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -28.79125 | 31.957917 | null | null | null | null | null | null | "2015-05-09" | 9 | 5 | 2015 | 2475679 | 2475679 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid68… | null | "Mr JA Gouws" | null | "CC_BY_4_0" | null | "Mr JA Gouws" | null | null | "2026-03-07T10:46:21.252Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
| 2342367593 | "906e6978-e292-4a8b-9c39-adf6bb… | "urn:fiao:sabap2:fullprot:rid35… | "Animalia" | "Chordata" | "Aves" | "Passeriformes" | "Cisticolidae" | "Cisticola" | "Cisticola juncidis" | null | "SPECIES" | "Cisticola juncidis (Rafinesque… | null | null | "ZA" | null | "Mpumalanga" | "PRESENT" | null | "dd862d06-e6e9-4ab9-bc86-c875cc… | -25.79125 | 30.124583 | null | null | null | null | null | null | "2012-02-24" | 24 | 2 | 2012 | 2492822 | 2492822 | "HUMAN_OBSERVATION" | "FIAO" | "SABAP2" | "urn:fiao:sabap2:fullprot:rid35… | null | "Mr G Lockwood" | null | "CC_BY_4_0" | null | "Mr G Lockwood" | null | null | "2026-03-07T10:46:22.799Z" | null | "COORDINATE_ROUNDED;GEODETIC_DA… |
Polars GPU engine
In this section, we look at the basics of using Polars on GPU with RAPIDS cuDF.
Polars
Polars is an online analytical processing (OLAP) query engine for DataFrames in Python.
It is a newer, faster, and better option than pandas: it allows automatic multithreading and lazy evaluation, it builds on Apache Arrow to store data in memory, it has a clearer syntax and a consistent handling of missing data.
You can find more information in our introductory course and webinar on Polars as well as our webinar comparing it to pandas.
Below is an example with data from the GBIF website (free and open access biodiversity database). The Southern African Bird Atlas Project 2 [1] contains a CSV file of 12.3 GB.
Shape of the DataFrame: 25,687,526 rows, 50 columns.
I converted the CSV file into a (much!) better file format: Apache Parquet, a binary, machine-optimized, column-oriented file format with efficient encoding and compression, ideal for large tabular data. And I copied it to the training cluster, in a directory we can all read from.
Because of the size of the data and the small amount of memory available in our training cluster, if you try to read in the file into a Polars DataFrame, the kernel will die because of an out of memory (OOM) error (you can try to see what an OOM problem looks like in a Jupyter notebook: there is not a lot of feedback!).
The timing and result below are on my machine:
import polars as pl
df = pl.read_parquet('/project/def-sponsor00/data/sa_birds.parquet')
dfFrom that dataset, I want a list of species from the genus Passer (Old World sparrows):
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
)| species |
|---|
| str |
| "Passer griseus" |
| "Passer domesticus" |
| "Passer motitensis" |
| "Passer melanurus" |
| "Passer diffusus" |
Now, we can time this query:
%%timeit
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
)234 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars lazy API
One of the great strengths of Polars is its lazy API: when you create a LazyFrame, Polars doesn’t run the code eagerly, one operation at a time; instead, it creates a query plan (a graph) that only gets resolved when you collect the result (in the form of a classic Polars DataFrame).
This allows for optimizations and fusions of operations. It also prevents the creation of intermediate objects that take space in memory. Finally, it allows to run queries on datasets too big to fit in memory.
Here is what it looks like (this one runs without any issue on the training cluster!):
import polars as pl
df = pl.scan_parquet('/project/def-sponsor00/data/sa_birds.parquet')
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
.collect()
)| species |
|---|
| str |
| "Passer griseus" |
| "Passer diffusus" |
| "Passer motitensis" |
| "Passer domesticus" |
| "Passer melanurus" |
We can time it too:
%%timeit
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
.collect()
)79 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We have a small speedup (factor of 3), but the main advantage is the memory required.
When to use the lazy API?
- Use it whenever you are dealing with large datasets.
- The only option to work out-of-core (data too big to fit in memory).
Here, it makes perfect sense to use the lazy API: the data is very big.
Should you always try to use the lazy API?
Yes! Unless you are dealing with tiny DataFrames, it is always advantageous to use the lazy API. It will either speed computations up or save you memory, or both, depending on the situation. And as we just saw, it will allow you to run queries that you would otherwise not be able to run with the available memory).
Polars on GPU
The GPU engine builds on the lazy API: Polars dispatches the query plan to RAPIDS cuDF for execution on the GPU (if possible). The collected result is returned in the form of a classic Polars DataFrame on the CPU:
df = pl.scan_parquet('/project/def-sponsor00/data/sa_birds.parquet')
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
.collect(engine='gpu')
)| species |
|---|
| str |
| "Passer domesticus" |
| "Passer motitensis" |
| "Passer diffusus" |
| "Passer melanurus" |
| "Passer griseus" |
And the timing:
%%timeit
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
.collect(engine='gpu')
)95.7 ms ± 378 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see, the result is actually slower here.
Did the query actually run on the GPU? (Remember that Polars cuDF will, by default, quietly fallback on the CPU for queries that are not yet supported on the GPU).
One way to check is to run the query in verbose mode:
with pl.Config() as cfg:
cfg.set_verbose(True)
(
df.filter(pl.col('genus') == 'Passer')
.select(pl.col('species'))
.unique()
.collect(engine='gpu')
)No warning means that the query was run on the GPU.
We will discuss the other method in the next section.
When to use GPUs?
- When the computations are intensive and can benefit from vast parallelization.
- Simple queries on very large data won’t benefit from GPUs because the cost of transferring the data to and from the GPU is large and the benefit is small (particularly since Polars already runs computations on multiple threads, so there is already some parallelization).
Should you always try to use GPUs?
No! Benchmarks done by the Polars team show that queries heavy in grouped aggregations and joins benefit most from the GPU engine. By contrast, queries dominated by I/O show similar speeds on CPU and GPU (as we just saw).
Configurations
Default configuration
The default configuration Works in most cases. To use it, as we saw, simply pass engine="cpu" in the collect method.
Configuration options
You can create a GPU engine thanks to polars.GPUEngine and pass options to it.
Example:
- Use the
in-memoryexecutor (default isstreaming), - select device 1 (if you have at least 2 GPUs; default is
0), - raise an error if a query cannot run on GPU (default is to silently fall back to the CPU):
engine = pl.GPUEngine(
executor="in-memory",
device=1,
raise_on_fail=True
)You then pass this engine as the value of the engine argument of the collect method: <your-query-as-a-lazy-frame>.collect(engine=engine).
The second method to ensure that our computations ran on GPU (and did not silently fallback on CPU) is change the configuration options.
Your turn:
Can you write the code that would test this?
Executors
Streaming
Streaming splits the data into partitions that are streamed through the query graph. Because it scales best and works very well on parquet files, this is what you want to use for large data.
Single GPU
This is the default of the Polars GPU engine.
Using .collect(engine="gpu") as we did earlier is equivalent to creating the following engines:
engine=pl.GPUEngine()
# or the following with the default options
engine=pl.GPUEngine(executor="streaming", executor_options={"cluster": "single"})and then using them with .collect(engine=engine).
You can pass additional options to the executor
engine = pl.GPUEngine(
executor_options={"max_rows_per_partition": 1_000_000}
)While using parquet files, you can pass additional parquet options:
engine = GPUEngine(
parquet_options={
'chunked': True,
'chunk_read_limit': int(1e9),
'pass_read_limit': int(4e9)
}
)Multiple GPUs
If you want to use multiple GPUs, you can create a distributed streaming executor:
engine = pl.GPUEngine(executor_options={"cluster": "distributed"})In-memory
If you have small data that fits easily in memory, you can run the in-memory executor which will have less overhead. But careful that it will not scale well however:
engine = pl.GPUEngine(executor="in-memory")