Subsetting data

Author

Marie-Hélène Burle

The syntax to subset data is very different in Polars compared to the indexing of pandas and other languages. Action verbs are used in a style very similar to that of R’s dplyr from the tidyverse.

Let’s read in another online dataset from Vega-Altair into a DataFrame:

import polars as pl

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv")

df
shape: (803, 3)
Entity Year Deaths
str i64 i64
"All natural disasters" 1900 1267360
"All natural disasters" 1901 200018
"All natural disasters" 1902 46037
"All natural disasters" 1903 6506
"All natural disasters" 1905 22758
"Wildfire" 2013 35
"Wildfire" 2014 16
"Wildfire" 2015 67
"Wildfire" 2016 39
"Wildfire" 2017 75

Selecting rows

You can create a new DataFrame with a subset of rows matching some condition with polars.DataFrame.filter.

Let’s select rows for the year 2001. For this, we select the column Year by its name with polars.col and return the rows when the values for that column equal 2001:

df_sub_row = df.filter(pl.col("Year") == 2001)

df_sub_row
shape: (9, 3)
Entity Year Deaths
str i64 i64
"All natural disasters" 2001 39493
"Drought" 2001 99
"Earthquake" 2001 21348
"Epidemic" 2001 8515
"Extreme temperature" 2001 1787
"Extreme weather" 2001 1911
"Flood" 2001 5014
"Landslide" 2001 786
"Wildfire" 2001 33

You can combine multiple conditions:

df_sub_row = df.filter(
    pl.col("Year") == 2001,
    pl.col("Entity") == "Flood"
    )

df_sub_row
shape: (1, 3)
Entity Year Deaths
str i64 i64
"Flood" 2001 5014

Selecting columns

To select columns (variables), you use polars.DataFrame.select:

df_sub_col = df.select(
    pl.col("Entity"),
    pl.col("Year")
    )

df_sub_col
shape: (803, 2)
Entity Year
str i64
"All natural disasters" 1900
"All natural disasters" 1901
"All natural disasters" 1902
"All natural disasters" 1903
"All natural disasters" 1905
"Wildfire" 2013
"Wildfire" 2014
"Wildfire" 2015
"Wildfire" 2016
"Wildfire" 2017

Note that if you select a single column, you still have a DataFrame:

df_onecol = df.select(pl.col("Entity"))

df_onecol
shape: (803, 1)
Entity
str
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"
type(df_onecol)
polars.dataframe.frame.DataFrame

Extracting Series

As we saw earlier, Polars DataFrames are made of Series: one-dimensional homogeneous data structures representing the columns of the DataFrames.

To extract a Series out of a DataFrames, you can use polars.DataFrame.get_column if you want to use the column name:

s_entity = df.get_column("Entity")

s_entity
shape: (803,)
Entity
str
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"

Or you can use polars.DataFrame.to_series if you want to extract the column by its index:

s_entity = df.to_series(0)

s_entity
shape: (803,)
Entity
str
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"All natural disasters"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"
"Wildfire"
type(s_entity)
polars.series.series.Series

Using polars.DataFrame.unique and polars.Series.to_list, you can get a list of all the types of natural disasters in this dataset (we can then sort the list with the standard list.sort method):

disasters = s_entity.unique().to_list()
disasters.sort()

disasters
['All natural disasters',
 'Drought',
 'Earthquake',
 'Epidemic',
 'Extreme temperature',
 'Extreme weather',
 'Flood',
 'Landslide',
 'Mass movement (dry)',
 'Volcanic activity',
 'Wildfire']
type(disasters)
list

Modifying selected columns

You can modify selected columns to create new columns. The new columns can be named with polars.Expr.alias:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    (pl.col("Deaths") / 1000).alias("Kilodeaths")
)

df_col_mod
shape: (803, 3)
Entity Year Kilodeaths
str i64 f64
"All natural disasters" 1900 1267.36
"All natural disasters" 1901 200.018
"All natural disasters" 1902 46.037
"All natural disasters" 1903 6.506
"All natural disasters" 1905 22.758
"Wildfire" 2013 0.035
"Wildfire" 2014 0.016
"Wildfire" 2015 0.067
"Wildfire" 2016 0.039
"Wildfire" 2017 0.075

Or you can name the new columns by assigning their values to the new names:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    Kilodeaths=pl.col("Deaths") / 1000
)

df_col_mod
shape: (803, 3)
Entity Year Kilodeaths
str i64 f64
"All natural disasters" 1900 1267.36
"All natural disasters" 1901 200.018
"All natural disasters" 1902 46.037
"All natural disasters" 1903 6.506
"All natural disasters" 1905 22.758
"Wildfire" 2013 0.035
"Wildfire" 2014 0.016
"Wildfire" 2015 0.067
"Wildfire" 2016 0.039
"Wildfire" 2017 0.075

Adding modified columns

If you want to add the modified columns to the initial DataFrame (instead of selecting them), you use polars.DataFrame.with_columns. The naming works in the same way:

df_mod = df.with_columns((pl.col("Deaths") / 1000).alias("Kilodeaths"))

df_mod
shape: (803, 4)
Entity Year Deaths Kilodeaths
str i64 i64 f64
"All natural disasters" 1900 1267360 1267.36
"All natural disasters" 1901 200018 200.018
"All natural disasters" 1902 46037 46.037
"All natural disasters" 1903 6506 6.506
"All natural disasters" 1905 22758 22.758
"Wildfire" 2013 35 0.035
"Wildfire" 2014 16 0.016
"Wildfire" 2015 67 0.067
"Wildfire" 2016 39 0.039
"Wildfire" 2017 75 0.075

Or, with the alternative column naming:

df_mod = df.with_columns(Kilodeaths=pl.col("Deaths") / 1000)

df_mod
shape: (803, 4)
Entity Year Deaths Kilodeaths
str i64 i64 f64
"All natural disasters" 1900 1267360 1267.36
"All natural disasters" 1901 200018 200.018
"All natural disasters" 1902 46037 46.037
"All natural disasters" 1903 6506 6.506
"All natural disasters" 1905 22758 22.758
"Wildfire" 2013 35 0.035
"Wildfire" 2014 16 0.016
"Wildfire" 2015 67 0.067
"Wildfire" 2016 39 0.039
"Wildfire" 2017 75 0.075

Notice that our new variable got added as the last column of the DataFrame.

Group by operations

If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by.

For instance, if we want to get the total number of deaths for each category of disaster, we can do:

totals = df.group_by(
    pl.col("Entity")
).agg(
    pl.col("Deaths").sum()
)

totals
shape: (11, 2)
Entity Deaths
str i64
"Flood" 6954992
"Extreme temperature" 182604
"All natural disasters" 32607156
"Mass movement (dry)" 5030
"Epidemic" 9596463
"Extreme weather" 1396601
"Landslide" 63068
"Wildfire" 3925
"Earthquake" 2576801
"Drought" 11731294

Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):

totals = df.group_by(
    pl.col("Entity"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

totals
shape: (11, 2)
Entity Deaths
str i64
"All natural disasters" 32607156
"Drought" 11731294
"Earthquake" 2576801
"Epidemic" 9596463
"Extreme temperature" 182604
"Flood" 6954992
"Landslide" 63068
"Mass movement (dry)" 5030
"Volcanic activity" 96366
"Wildfire" 3925

Your turn:

Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:

shape: (117, 2)
Year Deaths
i64 i64
1900 1267360
1901 200018
1902 46037
1903 6506
1905 22758
2013 22225
2014 20882
2015 23893
2016 10201
2017 2087

Combining contexts

select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions).

Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:

decade_totals = df.filter(
    pl.col("Entity") == "All natural disasters"
).with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade")
).group_by(
    pl.col("Decade"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals
shape: (12, 2)
Decade Deaths
i64 i64
1900 4497847
1910 3326492
1920 8726293
1930 4701024
1940 3871695
1970 986867
1980 796782
1990 527613
2000 839986
2010 454950

Or one with the number of deaths for that decade for each type of disaster:

decade_totals_by_type = df.with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade"),
).group_by([pl.col("Decade"), pl.col("Entity")],
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

decade_totals_by_type
shape: (122, 3)
Decade Entity Deaths
i64 str i64
1900 "All natural disasters" 4497847
1910 "All natural disasters" 3326492
1920 "All natural disasters" 8726293
1930 "All natural disasters" 4701024
1940 "All natural disasters" 3871695
1970 "Wildfire" 5
1980 "Wildfire" 396
1990 "Wildfire" 859
2000 "Wildfire" 629
2010 "Wildfire" 429