Subsetting data

Author

Marie-Hélène Burle

The syntax to subset data is very different in Polars compared to the indexing of pandas and other languages. Action verbs are used in a style very similar to that of R’s dplyr from the tidyverse.

Let’s read in another online dataset from Vega-Altair into a DataFrame:

import polars as pl

df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv")

print(df)
shape: (803, 3)
┌───────────────────────┬──────┬─────────┐
│ Entity                ┆ Year ┆ Deaths  │
│ ---                   ┆ ---  ┆ ---     │
│ str                   ┆ i64  ┆ i64     │
╞═══════════════════════╪══════╪═════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 │
│ All natural disasters ┆ 1901 ┆ 200018  │
│ All natural disasters ┆ 1902 ┆ 46037   │
│ All natural disasters ┆ 1903 ┆ 6506    │
│ All natural disasters ┆ 1905 ┆ 22758   │
│ …                     ┆ …    ┆ …       │
│ Wildfire              ┆ 2013 ┆ 35      │
│ Wildfire              ┆ 2014 ┆ 16      │
│ Wildfire              ┆ 2015 ┆ 67      │
│ Wildfire              ┆ 2016 ┆ 39      │
│ Wildfire              ┆ 2017 ┆ 75      │
└───────────────────────┴──────┴─────────┘

Selecting rows

You can create a new DataFrame with a subset of rows matching some condition with polars.DataFrame.filter.

Let’s select rows for the year 2001. For this, we select the column Year by its name with polars.col and return the rows when the values for that column equal 2001:

df_sub_row = df.filter(pl.col("Year") == 2001)

print(df_sub_row)
shape: (9, 3)
┌───────────────────────┬──────┬────────┐
│ Entity                ┆ Year ┆ Deaths │
│ ---                   ┆ ---  ┆ ---    │
│ str                   ┆ i64  ┆ i64    │
╞═══════════════════════╪══════╪════════╡
│ All natural disasters ┆ 2001 ┆ 39493  │
│ Drought               ┆ 2001 ┆ 99     │
│ Earthquake            ┆ 2001 ┆ 21348  │
│ Epidemic              ┆ 2001 ┆ 8515   │
│ Extreme temperature   ┆ 2001 ┆ 1787   │
│ Extreme weather       ┆ 2001 ┆ 1911   │
│ Flood                 ┆ 2001 ┆ 5014   │
│ Landslide             ┆ 2001 ┆ 786    │
│ Wildfire              ┆ 2001 ┆ 33     │
└───────────────────────┴──────┴────────┘

You can combine multiple conditions:

df_sub_row = df.filter(
    pl.col("Year") == 2001,
    pl.col("Entity") == "Flood"
    )

print(df_sub_row)
shape: (1, 3)
┌────────┬──────┬────────┐
│ Entity ┆ Year ┆ Deaths │
│ ---    ┆ ---  ┆ ---    │
│ str    ┆ i64  ┆ i64    │
╞════════╪══════╪════════╡
│ Flood  ┆ 2001 ┆ 5014   │
└────────┴──────┴────────┘

Selecting columns

To select columns (variables), you use polars.DataFrame.select:

df_sub_col = df.select(
    pl.col("Entity"),
    pl.col("Year")
    )

print(df_sub_col)
shape: (803, 2)
┌───────────────────────┬──────┐
│ Entity                ┆ Year │
│ ---                   ┆ ---  │
│ str                   ┆ i64  │
╞═══════════════════════╪══════╡
│ All natural disasters ┆ 1900 │
│ All natural disasters ┆ 1901 │
│ All natural disasters ┆ 1902 │
│ All natural disasters ┆ 1903 │
│ All natural disasters ┆ 1905 │
│ …                     ┆ …    │
│ Wildfire              ┆ 2013 │
│ Wildfire              ┆ 2014 │
│ Wildfire              ┆ 2015 │
│ Wildfire              ┆ 2016 │
│ Wildfire              ┆ 2017 │
└───────────────────────┴──────┘

Note that if you select a single column, you still have a DataFrame:

df_onecol = df.select(pl.col("Entity"))

print(df_onecol)
shape: (803, 1)
┌───────────────────────┐
│ Entity                │
│ ---                   │
│ str                   │
╞═══════════════════════╡
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ …                     │
│ Wildfire              │
│ Wildfire              │
│ Wildfire              │
│ Wildfire              │
│ Wildfire              │
└───────────────────────┘
type(df_onecol)
polars.dataframe.frame.DataFrame

Extracting Series

As we saw earlier, Polars DataFrames are made of Series: one-dimensional homogeneous data structures representing the columns of the DataFrames.

To extract a Series out of a DataFrames, you can use polars.DataFrame.get_column if you want to use the column name:

s_entity = df.get_column("Entity")

print(s_entity)
shape: (803,)
Series: 'Entity' [str]
[
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    …
    "Wildfire"
    "Wildfire"
    "Wildfire"
    "Wildfire"
    "Wildfire"
]

Or you can use polars.DataFrame.to_series if you want to extract the column by its index:

s_entity = df.to_series(0)

print(s_entity)
shape: (803,)
Series: 'Entity' [str]
[
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    "All natural disasters"
    …
    "Wildfire"
    "Wildfire"
    "Wildfire"
    "Wildfire"
    "Wildfire"
]
type(s_entity)
polars.series.series.Series

Using polars.DataFrame.unique and polars.Series.to_list, you can get a list of all the types of natural disasters in this dataset (we can then sort the list with the standard list.sort method):

disasters = s_entity.unique().to_list()
disasters.sort()

disasters
['All natural disasters',
 'Drought',
 'Earthquake',
 'Epidemic',
 'Extreme temperature',
 'Extreme weather',
 'Flood',
 'Landslide',
 'Mass movement (dry)',
 'Volcanic activity',
 'Wildfire']
type(disasters)
list

Modifying selected columns

You can modify selected columns to create new columns. The new columns can be named with polars.Expr.alias:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    (pl.col("Deaths") / 1000).alias("Kilodeaths")
)

print(df_col_mod)
shape: (803, 3)
┌───────────────────────┬──────┬────────────┐
│ Entity                ┆ Year ┆ Kilodeaths │
│ ---                   ┆ ---  ┆ ---        │
│ str                   ┆ i64  ┆ f64        │
╞═══════════════════════╪══════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267.36    │
│ All natural disasters ┆ 1901 ┆ 200.018    │
│ All natural disasters ┆ 1902 ┆ 46.037     │
│ All natural disasters ┆ 1903 ┆ 6.506      │
│ All natural disasters ┆ 1905 ┆ 22.758     │
│ …                     ┆ …    ┆ …          │
│ Wildfire              ┆ 2013 ┆ 0.035      │
│ Wildfire              ┆ 2014 ┆ 0.016      │
│ Wildfire              ┆ 2015 ┆ 0.067      │
│ Wildfire              ┆ 2016 ┆ 0.039      │
│ Wildfire              ┆ 2017 ┆ 0.075      │
└───────────────────────┴──────┴────────────┘

Or you can name the new columns by assigning their values to the new names:

df_col_mod = df.select(
    pl.col("Entity"),
    pl.col("Year"),
    Kilodeaths=pl.col("Deaths") / 1000
)

print(df_col_mod)
shape: (803, 3)
┌───────────────────────┬──────┬────────────┐
│ Entity                ┆ Year ┆ Kilodeaths │
│ ---                   ┆ ---  ┆ ---        │
│ str                   ┆ i64  ┆ f64        │
╞═══════════════════════╪══════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267.36    │
│ All natural disasters ┆ 1901 ┆ 200.018    │
│ All natural disasters ┆ 1902 ┆ 46.037     │
│ All natural disasters ┆ 1903 ┆ 6.506      │
│ All natural disasters ┆ 1905 ┆ 22.758     │
│ …                     ┆ …    ┆ …          │
│ Wildfire              ┆ 2013 ┆ 0.035      │
│ Wildfire              ┆ 2014 ┆ 0.016      │
│ Wildfire              ┆ 2015 ┆ 0.067      │
│ Wildfire              ┆ 2016 ┆ 0.039      │
│ Wildfire              ┆ 2017 ┆ 0.075      │
└───────────────────────┴──────┴────────────┘

Adding modified columns

If you want to add the modified columns to the initial DataFrame (instead of selecting them), you use polars.DataFrame.with_columns. The naming works in the same way:

df_mod = df.with_columns((pl.col("Deaths") / 1000).alias("Kilodeaths"))

print(df_mod)
shape: (803, 4)
┌───────────────────────┬──────┬─────────┬────────────┐
│ Entity                ┆ Year ┆ Deaths  ┆ Kilodeaths │
│ ---                   ┆ ---  ┆ ---     ┆ ---        │
│ str                   ┆ i64  ┆ i64     ┆ f64        │
╞═══════════════════════╪══════╪═════════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 ┆ 1267.36    │
│ All natural disasters ┆ 1901 ┆ 200018  ┆ 200.018    │
│ All natural disasters ┆ 1902 ┆ 46037   ┆ 46.037     │
│ All natural disasters ┆ 1903 ┆ 6506    ┆ 6.506      │
│ All natural disasters ┆ 1905 ┆ 22758   ┆ 22.758     │
│ …                     ┆ …    ┆ …       ┆ …          │
│ Wildfire              ┆ 2013 ┆ 35      ┆ 0.035      │
│ Wildfire              ┆ 2014 ┆ 16      ┆ 0.016      │
│ Wildfire              ┆ 2015 ┆ 67      ┆ 0.067      │
│ Wildfire              ┆ 2016 ┆ 39      ┆ 0.039      │
│ Wildfire              ┆ 2017 ┆ 75      ┆ 0.075      │
└───────────────────────┴──────┴─────────┴────────────┘

Or, with the alternative column naming:

df_mod = df.with_columns(Kilodeaths=pl.col("Deaths") / 1000)

print(df_mod)
shape: (803, 4)
┌───────────────────────┬──────┬─────────┬────────────┐
│ Entity                ┆ Year ┆ Deaths  ┆ Kilodeaths │
│ ---                   ┆ ---  ┆ ---     ┆ ---        │
│ str                   ┆ i64  ┆ i64     ┆ f64        │
╞═══════════════════════╪══════╪═════════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 ┆ 1267.36    │
│ All natural disasters ┆ 1901 ┆ 200018  ┆ 200.018    │
│ All natural disasters ┆ 1902 ┆ 46037   ┆ 46.037     │
│ All natural disasters ┆ 1903 ┆ 6506    ┆ 6.506      │
│ All natural disasters ┆ 1905 ┆ 22758   ┆ 22.758     │
│ …                     ┆ …    ┆ …       ┆ …          │
│ Wildfire              ┆ 2013 ┆ 35      ┆ 0.035      │
│ Wildfire              ┆ 2014 ┆ 16      ┆ 0.016      │
│ Wildfire              ┆ 2015 ┆ 67      ┆ 0.067      │
│ Wildfire              ┆ 2016 ┆ 39      ┆ 0.039      │
│ Wildfire              ┆ 2017 ┆ 75      ┆ 0.075      │
└───────────────────────┴──────┴─────────┴────────────┘

Notice that our new variable got added as the last column of the DataFrame.

Group by operations

If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by.

For instance, if we want to get the total number of deaths for each category of disaster, we can do:

totals = df.group_by(
    pl.col("Entity")
).agg(
    pl.col("Deaths").sum()
)

print(totals)
shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity                ┆ Deaths   │
│ ---                   ┆ ---      │
│ str                   ┆ i64      │
╞═══════════════════════╪══════════╡
│ Extreme weather       ┆ 1396601  │
│ Landslide             ┆ 63068    │
│ Earthquake            ┆ 2576801  │
│ Flood                 ┆ 6954992  │
│ Epidemic              ┆ 9596463  │
│ …                     ┆ …        │
│ Volcanic activity     ┆ 96366    │
│ All natural disasters ┆ 32607156 │
│ Extreme temperature   ┆ 182604   │
│ Drought               ┆ 11731294 │
│ Wildfire              ┆ 3925     │
└───────────────────────┴──────────┘

Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):

totals = df.group_by(
    pl.col("Entity"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

print(totals)
shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity                ┆ Deaths   │
│ ---                   ┆ ---      │
│ str                   ┆ i64      │
╞═══════════════════════╪══════════╡
│ All natural disasters ┆ 32607156 │
│ Drought               ┆ 11731294 │
│ Earthquake            ┆ 2576801  │
│ Epidemic              ┆ 9596463  │
│ Extreme temperature   ┆ 182604   │
│ …                     ┆ …        │
│ Flood                 ┆ 6954992  │
│ Landslide             ┆ 63068    │
│ Mass movement (dry)   ┆ 5030     │
│ Volcanic activity     ┆ 96366    │
│ Wildfire              ┆ 3925     │
└───────────────────────┴──────────┘

Your turn:

Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:

shape: (117, 2)
┌──────┬─────────┐
│ Year ┆ Deaths  │
│ ---  ┆ ---     │
│ i64  ┆ i64     │
╞══════╪═════════╡
│ 1900 ┆ 1267360 │
│ 1901 ┆ 200018  │
│ 1902 ┆ 46037   │
│ 1903 ┆ 6506    │
│ 1905 ┆ 22758   │
│ …    ┆ …       │
│ 2013 ┆ 22225   │
│ 2014 ┆ 20882   │
│ 2015 ┆ 23893   │
│ 2016 ┆ 10201   │
│ 2017 ┆ 2087    │
└──────┴─────────┘

Combining contexts

select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions).

Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:

decade_totals = df.filter(
    pl.col("Entity") == "All natural disasters"
).with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade")
).group_by(
    pl.col("Decade"),
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

print(decade_totals)
shape: (12, 2)
┌────────┬─────────┐
│ Decade ┆ Deaths  │
│ ---    ┆ ---     │
│ i64    ┆ i64     │
╞════════╪═════════╡
│ 1900   ┆ 4497847 │
│ 1910   ┆ 3326492 │
│ 1920   ┆ 8726293 │
│ 1930   ┆ 4701024 │
│ 1940   ┆ 3871695 │
│ …      ┆ …       │
│ 1970   ┆ 986867  │
│ 1980   ┆ 796782  │
│ 1990   ┆ 527613  │
│ 2000   ┆ 839986  │
│ 2010   ┆ 454950  │
└────────┴─────────┘

Or one with the number of deaths for that decade for each type of disaster:

decade_totals_by_type = df.with_columns(
    (pl.col("Year") // 10 * 10).alias("Decade"),
).group_by([pl.col("Decade"), pl.col("Entity")],
    maintain_order=True
).agg(
    pl.col("Deaths").sum()
)

print(decade_totals_by_type)
shape: (122, 3)
┌────────┬───────────────────────┬─────────┐
│ Decade ┆ Entity                ┆ Deaths  │
│ ---    ┆ ---                   ┆ ---     │
│ i64    ┆ str                   ┆ i64     │
╞════════╪═══════════════════════╪═════════╡
│ 1900   ┆ All natural disasters ┆ 4497847 │
│ 1910   ┆ All natural disasters ┆ 3326492 │
│ 1920   ┆ All natural disasters ┆ 8726293 │
│ 1930   ┆ All natural disasters ┆ 4701024 │
│ 1940   ┆ All natural disasters ┆ 3871695 │
│ …      ┆ …                     ┆ …       │
│ 1970   ┆ Wildfire              ┆ 5       │
│ 1980   ┆ Wildfire              ┆ 396     │
│ 1990   ┆ Wildfire              ┆ 859     │
│ 2000   ┆ Wildfire              ┆ 629     │
│ 2010   ┆ Wildfire              ┆ 429     │
└────────┴───────────────────────┴─────────┘