The syntax to subset data is very different in Polars compared to the indexing of pandas and other languages. Action verbs are used in a style very similar to that of R’s dplyr from the tidyverse .
Let’s read in another online dataset from Vega-Altair into a DataFrame:
import polars as pl
df = pl.read_csv("https://cdn.jsdelivr.net/npm/vega-datasets/data/disasters.csv" )
print (df)
shape: (803, 3)
┌───────────────────────┬──────┬─────────┐
│ Entity ┆ Year ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════════════════════╪══════╪═════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 │
│ All natural disasters ┆ 1901 ┆ 200018 │
│ All natural disasters ┆ 1902 ┆ 46037 │
│ All natural disasters ┆ 1903 ┆ 6506 │
│ All natural disasters ┆ 1905 ┆ 22758 │
│ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 35 │
│ Wildfire ┆ 2014 ┆ 16 │
│ Wildfire ┆ 2015 ┆ 67 │
│ Wildfire ┆ 2016 ┆ 39 │
│ Wildfire ┆ 2017 ┆ 75 │
└───────────────────────┴──────┴─────────┘
Selecting rows
You can create a new DataFrame with a subset of rows matching some condition with polars.DataFrame.filter .
Let’s select rows for the year 2001. For this, we select the column Year by its name with polars.col and return the rows when the values for that column equal 2001:
df_sub_row = df.filter (pl.col("Year" ) == 2001 )
print (df_sub_row)
shape: (9, 3)
┌───────────────────────┬──────┬────────┐
│ Entity ┆ Year ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════════════════════╪══════╪════════╡
│ All natural disasters ┆ 2001 ┆ 39493 │
│ Drought ┆ 2001 ┆ 99 │
│ Earthquake ┆ 2001 ┆ 21348 │
│ Epidemic ┆ 2001 ┆ 8515 │
│ Extreme temperature ┆ 2001 ┆ 1787 │
│ Extreme weather ┆ 2001 ┆ 1911 │
│ Flood ┆ 2001 ┆ 5014 │
│ Landslide ┆ 2001 ┆ 786 │
│ Wildfire ┆ 2001 ┆ 33 │
└───────────────────────┴──────┴────────┘
You can combine multiple conditions:
df_sub_row = df.filter (
pl.col("Year" ) == 2001 ,
pl.col("Entity" ) == "Flood"
)
print (df_sub_row)
shape: (1, 3)
┌────────┬──────┬────────┐
│ Entity ┆ Year ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞════════╪══════╪════════╡
│ Flood ┆ 2001 ┆ 5014 │
└────────┴──────┴────────┘
Selecting columns
To select columns (variables), you use polars.DataFrame.select :
df_sub_col = df.select(
pl.col("Entity" ),
pl.col("Year" )
)
print (df_sub_col)
shape: (803, 2)
┌───────────────────────┬──────┐
│ Entity ┆ Year │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════╡
│ All natural disasters ┆ 1900 │
│ All natural disasters ┆ 1901 │
│ All natural disasters ┆ 1902 │
│ All natural disasters ┆ 1903 │
│ All natural disasters ┆ 1905 │
│ … ┆ … │
│ Wildfire ┆ 2013 │
│ Wildfire ┆ 2014 │
│ Wildfire ┆ 2015 │
│ Wildfire ┆ 2016 │
│ Wildfire ┆ 2017 │
└───────────────────────┴──────┘
Note that if you select a single column, you still have a DataFrame:
df_onecol = df.select(pl.col("Entity" ))
print (df_onecol)
shape: (803, 1)
┌───────────────────────┐
│ Entity │
│ --- │
│ str │
╞═══════════════════════╡
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ All natural disasters │
│ … │
│ Wildfire │
│ Wildfire │
│ Wildfire │
│ Wildfire │
│ Wildfire │
└───────────────────────┘
polars.dataframe.frame.DataFrame
Modifying selected columns
You can modify selected columns to create new columns. The new columns can be named with polars.Expr.alias :
df_col_mod = df.select(
pl.col("Entity" ),
pl.col("Year" ),
(pl.col("Deaths" ) / 1000 ).alias("Kilodeaths" )
)
print (df_col_mod)
shape: (803, 3)
┌───────────────────────┬──────┬────────────┐
│ Entity ┆ Year ┆ Kilodeaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22.758 │
│ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 0.075 │
└───────────────────────┴──────┴────────────┘
Or you can name the new columns by assigning their values to the new names:
df_col_mod = df.select(
pl.col("Entity" ),
pl.col("Year" ),
Kilodeaths= pl.col("Deaths" ) / 1000
)
print (df_col_mod)
shape: (803, 3)
┌───────────────────────┬──────┬────────────┐
│ Entity ┆ Year ┆ Kilodeaths │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22.758 │
│ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 0.075 │
└───────────────────────┴──────┴────────────┘
Adding modified columns
If you want to add the modified columns to the initial DataFrame (instead of selecting them), you use polars.DataFrame.with_columns . The naming works in the same way:
df_mod = df.with_columns((pl.col("Deaths" ) / 1000 ).alias("Kilodeaths" ))
print (df_mod)
shape: (803, 4)
┌───────────────────────┬──────┬─────────┬────────────┐
│ Entity ┆ Year ┆ Deaths ┆ Kilodeaths │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪═════════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200018 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46037 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6506 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22758 ┆ 22.758 │
│ … ┆ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 35 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 16 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 67 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 39 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 75 ┆ 0.075 │
└───────────────────────┴──────┴─────────┴────────────┘
Or, with the alternative column naming:
df_mod = df.with_columns(Kilodeaths= pl.col("Deaths" ) / 1000 )
print (df_mod)
shape: (803, 4)
┌───────────────────────┬──────┬─────────┬────────────┐
│ Entity ┆ Year ┆ Deaths ┆ Kilodeaths │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════╪══════╪═════════╪════════════╡
│ All natural disasters ┆ 1900 ┆ 1267360 ┆ 1267.36 │
│ All natural disasters ┆ 1901 ┆ 200018 ┆ 200.018 │
│ All natural disasters ┆ 1902 ┆ 46037 ┆ 46.037 │
│ All natural disasters ┆ 1903 ┆ 6506 ┆ 6.506 │
│ All natural disasters ┆ 1905 ┆ 22758 ┆ 22.758 │
│ … ┆ … ┆ … ┆ … │
│ Wildfire ┆ 2013 ┆ 35 ┆ 0.035 │
│ Wildfire ┆ 2014 ┆ 16 ┆ 0.016 │
│ Wildfire ┆ 2015 ┆ 67 ┆ 0.067 │
│ Wildfire ┆ 2016 ┆ 39 ┆ 0.039 │
│ Wildfire ┆ 2017 ┆ 75 ┆ 0.075 │
└───────────────────────┴──────┴─────────┴────────────┘
Notice that our new variable got added as the last column of the DataFrame.
Group by operations
If you want to perform operations on rows sharing the same value for some variable, you group those rows with polars.DataFrame.group_by .
For instance, if we want to get the total number of deaths for each category of disaster, we can do:
totals = df.group_by(
pl.col("Entity" )
).agg(
pl.col("Deaths" ).sum ()
)
print (totals)
shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity ┆ Deaths │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════════╡
│ Extreme weather ┆ 1396601 │
│ Landslide ┆ 63068 │
│ Earthquake ┆ 2576801 │
│ Flood ┆ 6954992 │
│ Epidemic ┆ 9596463 │
│ … ┆ … │
│ Volcanic activity ┆ 96366 │
│ All natural disasters ┆ 32607156 │
│ Extreme temperature ┆ 182604 │
│ Drought ┆ 11731294 │
│ Wildfire ┆ 3925 │
└───────────────────────┴──────────┘
Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our DataFrame. If you want to maintain the order however, you can use the maintain_order parameter (but this slows down the operation):
totals = df.group_by(
pl.col("Entity" ),
maintain_order= True
).agg(
pl.col("Deaths" ).sum ()
)
print (totals)
shape: (11, 2)
┌───────────────────────┬──────────┐
│ Entity ┆ Deaths │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════╪══════════╡
│ All natural disasters ┆ 32607156 │
│ Drought ┆ 11731294 │
│ Earthquake ┆ 2576801 │
│ Epidemic ┆ 9596463 │
│ Extreme temperature ┆ 182604 │
│ … ┆ … │
│ Flood ┆ 6954992 │
│ Landslide ┆ 63068 │
│ Mass movement (dry) ┆ 5030 │
│ Volcanic activity ┆ 96366 │
│ Wildfire ┆ 3925 │
└───────────────────────┴──────────┘
Create a new DataFrame, ordered by year, that shows the total number of deaths for each year:
shape: (117, 2)
┌──────┬─────────┐
│ Year ┆ Deaths │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪═════════╡
│ 1900 ┆ 1267360 │
│ 1901 ┆ 200018 │
│ 1902 ┆ 46037 │
│ 1903 ┆ 6506 │
│ 1905 ┆ 22758 │
│ … ┆ … │
│ 2013 ┆ 22225 │
│ 2014 ┆ 20882 │
│ 2015 ┆ 23893 │
│ 2016 ┆ 10201 │
│ 2017 ┆ 2087 │
└──────┴─────────┘
Combining contexts
select, with_columns, filter, and group_by are called contexts in the Polars terminology (the data transformations performed in these contexts are called expressions ).
Contexts can be combined. For instance, we can create a new DataFrame with the number of deaths for each decade:
decade_totals = df.filter (
pl.col("Entity" ) == "All natural disasters"
).with_columns(
(pl.col("Year" ) // 10 * 10 ).alias("Decade" )
).group_by(
pl.col("Decade" ),
maintain_order= True
).agg(
pl.col("Deaths" ).sum ()
)
print (decade_totals)
shape: (12, 2)
┌────────┬─────────┐
│ Decade ┆ Deaths │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪═════════╡
│ 1900 ┆ 4497847 │
│ 1910 ┆ 3326492 │
│ 1920 ┆ 8726293 │
│ 1930 ┆ 4701024 │
│ 1940 ┆ 3871695 │
│ … ┆ … │
│ 1970 ┆ 986867 │
│ 1980 ┆ 796782 │
│ 1990 ┆ 527613 │
│ 2000 ┆ 839986 │
│ 2010 ┆ 454950 │
└────────┴─────────┘
Or one with the number of deaths for that decade for each type of disaster:
decade_totals_by_type = df.with_columns(
(pl.col("Year" ) // 10 * 10 ).alias("Decade" ),
).group_by([pl.col("Decade" ), pl.col("Entity" )],
maintain_order= True
).agg(
pl.col("Deaths" ).sum ()
)
print (decade_totals_by_type)
shape: (122, 3)
┌────────┬───────────────────────┬─────────┐
│ Decade ┆ Entity ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════════╪═══════════════════════╪═════════╡
│ 1900 ┆ All natural disasters ┆ 4497847 │
│ 1910 ┆ All natural disasters ┆ 3326492 │
│ 1920 ┆ All natural disasters ┆ 8726293 │
│ 1930 ┆ All natural disasters ┆ 4701024 │
│ 1940 ┆ All natural disasters ┆ 3871695 │
│ … ┆ … ┆ … │
│ 1970 ┆ Wildfire ┆ 5 │
│ 1980 ┆ Wildfire ┆ 396 │
│ 1990 ┆ Wildfire ┆ 859 │
│ 2000 ┆ Wildfire ┆ 629 │
│ 2010 ┆ Wildfire ┆ 429 │
└────────┴───────────────────────┴─────────┘