Subsetting data

Author

Marie-Hélène Burle

The syntax to subset data is very different in Polars compared to the indexing of pandas and other languages. Action verbs are used in a style very similar to that of R’s dplyr from the tidyverse.

Let’s start with the same data frame we used in the previous section:

import polars as pl

df = pl.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")

print(df)

shape: (216_930, 7)
┌─────────────┬──────────┬───────────┬─────────────────┬─────────┬────────────────┬────────────────┐
│ Show Number ┆ Air Date ┆ Round     ┆ Category        ┆ Value   ┆ Question       ┆ Answer         │
│ ---         ┆ ---      ┆ ---       ┆ ---             ┆ ---     ┆ ---            ┆ ---            │
│ i64         ┆ str      ┆ str       ┆ str             ┆ str     ┆ str            ┆ str            │
╞═════════════╪══════════╪═══════════╪═════════════════╪═════════╪════════════════╪════════════════╡
│ 4680        ┆ 12/31/04 ┆ Jeopardy! ┆ HISTORY         ┆ $200    ┆ For the last 8 ┆ Copernicus     │
│             ┆          ┆           ┆                 ┆         ┆ years of his   ┆                │
│             ┆          ┆           ┆                 ┆         ┆ li…            ┆                │
│ 4680        ┆ 12/31/04 ┆ Jeopardy! ┆ ESPN's TOP 10   ┆ $200    ┆ No. 2: 1912    ┆ Jim Thorpe     │
│             ┆          ┆           ┆ ALL-TIME        ┆         ┆ Olympian;      ┆                │
│             ┆          ┆           ┆ ATHLETE…        ┆         ┆ football…      ┆                │
│ 4680        ┆ 12/31/04 ┆ Jeopardy! ┆ EVERYBODY TALKS ┆ $200    ┆ The city of    ┆ Arizona        │
│             ┆          ┆           ┆ ABOUT IT...     ┆         ┆ Yuma in this   ┆                │
│             ┆          ┆           ┆                 ┆         ┆ state…         ┆                │
│ 4680        ┆ 12/31/04 ┆ Jeopardy! ┆ THE COMPANY     ┆ $200    ┆ In 1963, live  ┆ McDonald's     │
│             ┆          ┆           ┆ LINE            ┆         ┆ on "The Art    ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Link…          ┆                │
│ 4680        ┆ 12/31/04 ┆ Jeopardy! ┆ EPITAPHS &      ┆ $200    ┆ Signer of the  ┆ John Adams     │
│             ┆          ┆           ┆ TRIBUTES        ┆         ┆ Dec. of        ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Indep., …      ┆                │
│ …           ┆ …        ┆ …         ┆ …               ┆ …       ┆ …              ┆ …              │
│ 4999        ┆ 5/11/06  ┆ Double    ┆ RIDDLE ME THIS  ┆ $2,000  ┆ This Puccini   ┆ Turandot       │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ opera turns on ┆                │
│             ┆          ┆           ┆                 ┆         ┆ th…            ┆                │
│ 4999        ┆ 5/11/06  ┆ Double    ┆ "T" BIRDS       ┆ $2,000  ┆ In North       ┆ a titmouse     │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ America this   ┆                │
│             ┆          ┆           ┆                 ┆         ┆ term is …      ┆                │
│ 4999        ┆ 5/11/06  ┆ Double    ┆ AUTHORS IN      ┆ $2,000  ┆ In Penny Lane, ┆ Clive Barker   │
│             ┆          ┆ Jeopardy! ┆ THEIR YOUTH     ┆         ┆ where this     ┆                │
│             ┆          ┆           ┆                 ┆         ┆ "Hel…          ┆                │
│ 4999        ┆ 5/11/06  ┆ Double    ┆ QUOTATIONS      ┆ $2,000  ┆ From Ft. Sill, ┆ Geronimo       │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ Okla. he made  ┆                │
│             ┆          ┆           ┆                 ┆         ┆ t…             ┆                │
│ 4999        ┆ 5/11/06  ┆ Final     ┆ HISTORIC NAMES  ┆ None    ┆ A silent movie ┆ Grigori        │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ title includes ┆ Alexandrovich  │
│             ┆          ┆           ┆                 ┆         ┆ …              ┆ Potemkin       │
└─────────────┴──────────┴───────────┴─────────────────┴─────────┴────────────────┴────────────────┘

Selecting rows

You can select rows based on any expression that evaluates to a Boolean with filter:

df_sub = df.filter(
    pl.col("Air Date") == "5/8/09"
    )

print(df_sub)

shape: (61, 7)
┌─────────────┬──────────┬───────────┬─────────────────┬─────────┬────────────────┬────────────────┐
│ Show Number ┆ Air Date ┆ Round     ┆ Category        ┆ Value   ┆ Question       ┆ Answer         │
│ ---         ┆ ---      ┆ ---       ┆ ---             ┆ ---     ┆ ---            ┆ ---            │
│ i64         ┆ str      ┆ str       ┆ str             ┆ str     ┆ str            ┆ str            │
╞═════════════╪══════════╪═══════════╪═════════════════╪═════════╪════════════════╪════════════════╡
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ OLD FOLKS IN    ┆ $200    ┆ goop.com is a  ┆ Gwyneth        │
│             ┆          ┆           ┆ THEIR 30s       ┆         ┆ lifestyles     ┆ Paltrow        │
│             ┆          ┆           ┆                 ┆         ┆ websi…         ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ MOVIES & TV     ┆ $200    ┆ On March 19,   ┆ Jay Leno       │
│             ┆          ┆           ┆                 ┆         ┆ 2009 he said,  ┆                │
│             ┆          ┆           ┆                 ┆         ┆ "I'…           ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ A STATE OF      ┆ $200    ┆ Baylor,        ┆ Texas          │
│             ┆          ┆           ┆ COLLEGE-NESS    ┆         ┆ Stephen F.     ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Austin, Ric…   ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ ANIMAL          ┆ $200    ┆ Synonym for    ┆ a pride        │
│             ┆          ┆           ┆ COLLECTIVE      ┆         ┆ dignity that's ┆                │
│             ┆          ┆           ┆                 ┆         ┆ the…           ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ I'D RATHER BE   ┆ $200    ┆ If you're a    ┆ a bunny hill   │
│             ┆          ┆           ┆ SKIING          ┆         ┆ beginner, you  ┆                │
│             ┆          ┆           ┆                 ┆         ┆ migh…          ┆                │
│ …           ┆ …        ┆ …         ┆ …               ┆ …       ┆ …              ┆ …              │
│ 5690        ┆ 5/8/09   ┆ Double    ┆ ANATOMY         ┆ $2,000  ┆ The pons       ┆ the cerebellum │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ connects the 2 ┆                │
│             ┆          ┆           ┆                 ┆         ┆ hemisp…        ┆                │
│ 5690        ┆ 5/8/09   ┆ Double    ┆ MATHEM-ATTACK!  ┆ $2,000  ┆ (<a href="http ┆ volume         │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ ://www.j-archi ┆                │
│             ┆          ┆           ┆                 ┆         ┆ ve…            ┆                │
│ 5690        ┆ 5/8/09   ┆ Double    ┆ NAME THE DECADE ┆ $2,000  ┆ Man first      ┆ the 1910s      │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ reaches the    ┆                │
│             ┆          ┆           ┆                 ┆         ┆ South Po…      ┆                │
│ 5690        ┆ 5/8/09   ┆ Double    ┆ WORD ORIGINS    ┆ $2,000  ┆ A type of ear  ┆ cochlear       │
│             ┆          ┆ Jeopardy! ┆                 ┆         ┆ implant to     ┆                │
│             ┆          ┆           ┆                 ┆         ┆ help …         ┆                │
│ 5690        ┆ 5/8/09   ┆ Final     ┆ EUROPEAN        ┆ None    ┆ He filed for   ┆ Henry VIII     │
│             ┆          ┆ Jeopardy! ┆ HISTORY         ┆         ┆ divorce citing ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Le…            ┆                │
└─────────────┴──────────┴───────────┴─────────────────┴─────────┴────────────────┴────────────────┘

You can combine conditions:

df_sub = df.filter(
    pl.col("Air Date") == "5/8/09",
    pl.col("Round") != "Double Jeopardy!"
    )

print(df_sub)

shape: (31, 7)
┌─────────────┬──────────┬───────────┬─────────────────┬─────────┬────────────────┬────────────────┐
│ Show Number ┆ Air Date ┆ Round     ┆ Category        ┆ Value   ┆ Question       ┆ Answer         │
│ ---         ┆ ---      ┆ ---       ┆ ---             ┆ ---     ┆ ---            ┆ ---            │
│ i64         ┆ str      ┆ str       ┆ str             ┆ str     ┆ str            ┆ str            │
╞═════════════╪══════════╪═══════════╪═════════════════╪═════════╪════════════════╪════════════════╡
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ OLD FOLKS IN    ┆ $200    ┆ goop.com is a  ┆ Gwyneth        │
│             ┆          ┆           ┆ THEIR 30s       ┆         ┆ lifestyles     ┆ Paltrow        │
│             ┆          ┆           ┆                 ┆         ┆ websi…         ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ MOVIES & TV     ┆ $200    ┆ On March 19,   ┆ Jay Leno       │
│             ┆          ┆           ┆                 ┆         ┆ 2009 he said,  ┆                │
│             ┆          ┆           ┆                 ┆         ┆ "I'…           ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ A STATE OF      ┆ $200    ┆ Baylor,        ┆ Texas          │
│             ┆          ┆           ┆ COLLEGE-NESS    ┆         ┆ Stephen F.     ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Austin, Ric…   ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ ANIMAL          ┆ $200    ┆ Synonym for    ┆ a pride        │
│             ┆          ┆           ┆ COLLECTIVE      ┆         ┆ dignity that's ┆                │
│             ┆          ┆           ┆                 ┆         ┆ the…           ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ I'D RATHER BE   ┆ $200    ┆ If you're a    ┆ a bunny hill   │
│             ┆          ┆           ┆ SKIING          ┆         ┆ beginner, you  ┆                │
│             ┆          ┆           ┆                 ┆         ┆ migh…          ┆                │
│ …           ┆ …        ┆ …         ┆ …               ┆ …       ┆ …              ┆ …              │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ A STATE OF      ┆ $1,000  ┆ Grambling,     ┆ Louisiana      │
│             ┆          ┆           ┆ COLLEGE-NESS    ┆         ┆ McNeese State, ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Sout…          ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ ANIMAL          ┆ $1,000  ┆ A flock of     ┆ crows          │
│             ┆          ┆           ┆ COLLECTIVE      ┆         ┆ these black    ┆                │
│             ┆          ┆           ┆                 ┆         ┆ birds i…       ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ I'D RATHER BE   ┆ $1,000  ┆ Bumps or       ┆ moguls         │
│             ┆          ┆           ┆ SKIING          ┆         ┆ mounds of snow ┆                │
│             ┆          ┆           ┆                 ┆         ┆ that a…        ┆                │
│ 5690        ┆ 5/8/09   ┆ Jeopardy! ┆ PARLEZ VOUS?    ┆ $1,000  ┆ "Huitieme" is  ┆ eighth         │
│             ┆          ┆           ┆                 ┆         ┆ French for     ┆                │
│             ┆          ┆           ┆                 ┆         ┆ this …         ┆                │
│ 5690        ┆ 5/8/09   ┆ Final     ┆ EUROPEAN        ┆ None    ┆ He filed for   ┆ Henry VIII     │
│             ┆          ┆ Jeopardy! ┆ HISTORY         ┆         ┆ divorce citing ┆                │
│             ┆          ┆           ┆                 ┆         ┆ Le…            ┆                │
└─────────────┴──────────┴───────────┴─────────────────┴─────────┴────────────────┴────────────────┘

Selecting columns

To select columns (variables), you use select:

df_sub = df.select(
    pl.col("Show Number"),
    pl.col("Category")
    )

print(df_sub)

shape: (216_930, 2)
┌─────────────┬─────────────────────────────────┐
│ Show Number ┆ Category                        │
│ ---         ┆ ---                             │
│ i64         ┆ str                             │
╞═════════════╪═════════════════════════════════╡
│ 4680        ┆ HISTORY                         │
│ 4680        ┆ ESPN's TOP 10 ALL-TIME ATHLETE… │
│ 4680        ┆ EVERYBODY TALKS ABOUT IT...     │
│ 4680        ┆ THE COMPANY LINE                │
│ 4680        ┆ EPITAPHS & TRIBUTES             │
│ …           ┆ …                               │
│ 4999        ┆ RIDDLE ME THIS                  │
│ 4999        ┆ "T" BIRDS                       │
│ 4999        ┆ AUTHORS IN THEIR YOUTH          │
│ 4999        ┆ QUOTATIONS                      │
│ 4999        ┆ HISTORIC NAMES                  │
└─────────────┴─────────────────────────────────┘

Creating new columns with output of expressions

The jeopardy dataset is made mostly of String variables. Let’s use another one here: the now archived global confirmed Covid-19 cases from John Hopkins University:

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

df = pl.read_csv(url)

print(df)

shape: (289, 1_147)
┌────────────────┬────────────────┬────────────┬───────────┬───┬────────┬────────┬────────┬────────┐
│ Province/State ┆ Country/Region ┆ Lat        ┆ Long      ┆ … ┆ 3/6/23 ┆ 3/7/23 ┆ 3/8/23 ┆ 3/9/23 │
│ ---            ┆ ---            ┆ ---        ┆ ---       ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ str            ┆ str            ┆ f64        ┆ f64       ┆   ┆ i64    ┆ i64    ┆ i64    ┆ i64    │
╞════════════════╪════════════════╪════════════╪═══════════╪═══╪════════╪════════╪════════╪════════╡
│ null           ┆ Afghanistan    ┆ 33.93911   ┆ 67.709953 ┆ … ┆ 209406 ┆ 209436 ┆ 209451 ┆ 209451 │
│ null           ┆ Albania        ┆ 41.1533    ┆ 20.1683   ┆ … ┆ 334427 ┆ 334427 ┆ 334443 ┆ 334457 │
│ null           ┆ Algeria        ┆ 28.0339    ┆ 1.6596    ┆ … ┆ 271477 ┆ 271490 ┆ 271494 ┆ 271496 │
│ null           ┆ Andorra        ┆ 42.5063    ┆ 1.5218    ┆ … ┆ 47875  ┆ 47875  ┆ 47890  ┆ 47890  │
│ null           ┆ Angola         ┆ -11.2027   ┆ 17.8739   ┆ … ┆ 105277 ┆ 105277 ┆ 105288 ┆ 105288 │
│ …              ┆ …              ┆ …          ┆ …         ┆ … ┆ …      ┆ …      ┆ …      ┆ …      │
│ null           ┆ West Bank and  ┆ 31.9522    ┆ 35.2332   ┆ … ┆ 703228 ┆ 703228 ┆ 703228 ┆ 703228 │
│                ┆ Gaza           ┆            ┆           ┆   ┆        ┆        ┆        ┆        │
│ null           ┆ Winter         ┆ 39.9042    ┆ 116.4074  ┆ … ┆ 535    ┆ 535    ┆ 535    ┆ 535    │
│                ┆ Olympics 2022  ┆            ┆           ┆   ┆        ┆        ┆        ┆        │
│ null           ┆ Yemen          ┆ 15.552727  ┆ 48.516388 ┆ … ┆ 11945  ┆ 11945  ┆ 11945  ┆ 11945  │
│ null           ┆ Zambia         ┆ -13.133897 ┆ 27.849332 ┆ … ┆ 343135 ┆ 343135 ┆ 343135 ┆ 343135 │
│ null           ┆ Zimbabwe       ┆ -19.015438 ┆ 29.154857 ┆ … ┆ 264127 ┆ 264127 ┆ 264276 ┆ 264276 │
└────────────────┴────────────────┴────────────┴───────────┴───┴────────┴────────┴────────┴────────┘

To create a new variable called daily_avg with the daily average of new cases, we use select again, but this time we add an expression:

df_new = df.select(
    daily_avg=pl.col("3/9/23") / 1143
    )

print(df_new)

shape: (289, 1)
┌────────────┐
│ daily_avg  │
│ ---        │
│ f64        │
╞════════════╡
│ 183.246719 │
│ 292.613298 │
│ 237.529309 │
│ 41.898513  │
│ 92.115486  │
│ …          │
│ 615.247594 │
│ 0.468066   │
│ 10.450569  │
│ 300.205599 │
│ 231.212598 │
└────────────┘

Since the data is cumulative across dates, we took the last columns (totals cases for each row) and divided by the number of days of this dataset (total number of columns menus the four first columns).

If you want to keep all columns in the output, you use with_columns:

df_new = df.with_columns(
    daily_avg=pl.col("3/9/23") / 1143
    )

print(df_new)

shape: (289, 1_148)
┌──────────────┬──────────────┬────────────┬───────────┬───┬────────┬────────┬────────┬────────────┐
│ Province/Sta ┆ Country/Regi ┆ Lat        ┆ Long      ┆ … ┆ 3/7/23 ┆ 3/8/23 ┆ 3/9/23 ┆ daily_avg  │
│ te           ┆ on           ┆ ---        ┆ ---       ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---        │
│ ---          ┆ ---          ┆ f64        ┆ f64       ┆   ┆ i64    ┆ i64    ┆ i64    ┆ f64        │
│ str          ┆ str          ┆            ┆           ┆   ┆        ┆        ┆        ┆            │
╞══════════════╪══════════════╪════════════╪═══════════╪═══╪════════╪════════╪════════╪════════════╡
│ null         ┆ Afghanistan  ┆ 33.93911   ┆ 67.709953 ┆ … ┆ 209436 ┆ 209451 ┆ 209451 ┆ 183.246719 │
│ null         ┆ Albania      ┆ 41.1533    ┆ 20.1683   ┆ … ┆ 334427 ┆ 334443 ┆ 334457 ┆ 292.613298 │
│ null         ┆ Algeria      ┆ 28.0339    ┆ 1.6596    ┆ … ┆ 271490 ┆ 271494 ┆ 271496 ┆ 237.529309 │
│ null         ┆ Andorra      ┆ 42.5063    ┆ 1.5218    ┆ … ┆ 47875  ┆ 47890  ┆ 47890  ┆ 41.898513  │
│ null         ┆ Angola       ┆ -11.2027   ┆ 17.8739   ┆ … ┆ 105277 ┆ 105288 ┆ 105288 ┆ 92.115486  │
│ …            ┆ …            ┆ …          ┆ …         ┆ … ┆ …      ┆ …      ┆ …      ┆ …          │
│ null         ┆ West Bank    ┆ 31.9522    ┆ 35.2332   ┆ … ┆ 703228 ┆ 703228 ┆ 703228 ┆ 615.247594 │
│              ┆ and Gaza     ┆            ┆           ┆   ┆        ┆        ┆        ┆            │
│ null         ┆ Winter       ┆ 39.9042    ┆ 116.4074  ┆ … ┆ 535    ┆ 535    ┆ 535    ┆ 0.468066   │
│              ┆ Olympics     ┆            ┆           ┆   ┆        ┆        ┆        ┆            │
│              ┆ 2022         ┆            ┆           ┆   ┆        ┆        ┆        ┆            │
│ null         ┆ Yemen        ┆ 15.552727  ┆ 48.516388 ┆ … ┆ 11945  ┆ 11945  ┆ 11945  ┆ 10.450569  │
│ null         ┆ Zambia       ┆ -13.133897 ┆ 27.849332 ┆ … ┆ 343135 ┆ 343135 ┆ 343135 ┆ 300.205599 │
│ null         ┆ Zimbabwe     ┆ -19.015438 ┆ 29.154857 ┆ … ┆ 264127 ┆ 264276 ┆ 264276 ┆ 231.212598 │
└──────────────┴──────────────┴────────────┴───────────┴───┴────────┴────────┴────────┴────────────┘

Notice that our new variable got added as the last column of the data frame.

If we want to write in place, we can reassign the output to the initial data frame:

df = df.with_columns(
    daily_avg=pl.col("3/9/23") / 1143
    )

Group by operations

In this Covid-19 dataset some countries (e.g. Australia) are split between several provinces or states. If we want the total numbers for such countries we have to group the rows by the variable Country/Region, then get the sum for each group.

Getting the sums of the latitude and longitude wouldn’t make any sense, so first we get rid of those two columns:

df_clean = df.select(
    pl.col("*").exclude("Lat", "Long")
    )

print(df_clean)

shape: (289, 1_146)
┌────────────────┬─────────────────┬─────────┬─────────┬───┬────────┬────────┬────────┬────────────┐
│ Province/State ┆ Country/Region  ┆ 1/22/20 ┆ 1/23/20 ┆ … ┆ 3/7/23 ┆ 3/8/23 ┆ 3/9/23 ┆ daily_avg  │
│ ---            ┆ ---             ┆ ---     ┆ ---     ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---        │
│ str            ┆ str             ┆ i64     ┆ i64     ┆   ┆ i64    ┆ i64    ┆ i64    ┆ f64        │
╞════════════════╪═════════════════╪═════════╪═════════╪═══╪════════╪════════╪════════╪════════════╡
│ null           ┆ Afghanistan     ┆ 0       ┆ 0       ┆ … ┆ 209436 ┆ 209451 ┆ 209451 ┆ 183.246719 │
│ null           ┆ Albania         ┆ 0       ┆ 0       ┆ … ┆ 334427 ┆ 334443 ┆ 334457 ┆ 292.613298 │
│ null           ┆ Algeria         ┆ 0       ┆ 0       ┆ … ┆ 271490 ┆ 271494 ┆ 271496 ┆ 237.529309 │
│ null           ┆ Andorra         ┆ 0       ┆ 0       ┆ … ┆ 47875  ┆ 47890  ┆ 47890  ┆ 41.898513  │
│ null           ┆ Angola          ┆ 0       ┆ 0       ┆ … ┆ 105277 ┆ 105288 ┆ 105288 ┆ 92.115486  │
│ …              ┆ …               ┆ …       ┆ …       ┆ … ┆ …      ┆ …      ┆ …      ┆ …          │
│ null           ┆ West Bank and   ┆ 0       ┆ 0       ┆ … ┆ 703228 ┆ 703228 ┆ 703228 ┆ 615.247594 │
│                ┆ Gaza            ┆         ┆         ┆   ┆        ┆        ┆        ┆            │
│ null           ┆ Winter Olympics ┆ 0       ┆ 0       ┆ … ┆ 535    ┆ 535    ┆ 535    ┆ 0.468066   │
│                ┆ 2022            ┆         ┆         ┆   ┆        ┆        ┆        ┆            │
│ null           ┆ Yemen           ┆ 0       ┆ 0       ┆ … ┆ 11945  ┆ 11945  ┆ 11945  ┆ 10.450569  │
│ null           ┆ Zambia          ┆ 0       ┆ 0       ┆ … ┆ 343135 ┆ 343135 ┆ 343135 ┆ 300.205599 │
│ null           ┆ Zimbabwe        ┆ 0       ┆ 0       ┆ … ┆ 264127 ┆ 264276 ┆ 264276 ┆ 231.212598 │
└────────────────┴─────────────────┴─────────┴─────────┴───┴────────┴────────┴────────┴────────────┘

There are many ways to select columns from a data frame.

Now we can group by and get our sums:

df_countries = df_clean.group_by(
    (pl.col("Country/Region")).alias("Country totals")
    ).sum()

print(df_countries)

shape: (201, 1_147)
┌────────────┬────────────┬────────────┬─────────┬───┬──────────┬──────────┬──────────┬────────────┐
│ Country    ┆ Province/S ┆ Country/Re ┆ 1/22/20 ┆ … ┆ 3/7/23   ┆ 3/8/23   ┆ 3/9/23   ┆ daily_avg  │
│ totals     ┆ tate       ┆ gion       ┆ ---     ┆   ┆ ---      ┆ ---      ┆ ---      ┆ ---        │
│ ---        ┆ ---        ┆ ---        ┆ i64     ┆   ┆ i64      ┆ i64      ┆ i64      ┆ f64        │
│ str        ┆ str        ┆ str        ┆         ┆   ┆          ┆          ┆          ┆            │
╞════════════╪════════════╪════════════╪═════════╪═══╪══════════╪══════════╪══════════╪════════════╡
│ Mauritius  ┆ null       ┆ null       ┆ 0       ┆ … ┆ 295440   ┆ 296042   ┆ 296042   ┆ 259.004374 │
│ Tanzania   ┆ null       ┆ null       ┆ 0       ┆ … ┆ 42846    ┆ 42906    ┆ 42906    ┆ 37.538058  │
│ Burkina    ┆ null       ┆ null       ┆ 0       ┆ … ┆ 22056    ┆ 22056    ┆ 22056    ┆ 19.296588  │
│ Faso       ┆            ┆            ┆         ┆   ┆          ┆          ┆          ┆            │
│ Maldives   ┆ null       ┆ null       ┆ 0       ┆ … ┆ 185738   ┆ 185738   ┆ 185738   ┆ 162.500437 │
│ United     ┆ null       ┆ null       ┆ 0       ┆ … ┆ 24629846 ┆ 24629922 ┆ 24658705 ┆ 21573.6701 │
│ Kingdom    ┆            ┆            ┆         ┆   ┆          ┆          ┆          ┆ 66         │
│ …          ┆ …          ┆ …          ┆ …       ┆ … ┆ …        ┆ …        ┆ …        ┆ …          │
│ Israel     ┆ null       ┆ null       ┆ 0       ┆ … ┆ 4801502  ┆ 4803164  ┆ 4803824  ┆ 4202.82064 │
│            ┆            ┆            ┆         ┆   ┆          ┆          ┆          ┆ 7          │
│ Summer     ┆ null       ┆ null       ┆ 0       ┆ … ┆ 865      ┆ 865      ┆ 865      ┆ 0.75678    │
│ Olympics   ┆            ┆            ┆         ┆   ┆          ┆          ┆          ┆            │
│ 2020       ┆            ┆            ┆         ┆   ┆          ┆          ┆          ┆            │
│ San Marino ┆ null       ┆ null       ┆ 0       ┆ … ┆ 23616    ┆ 23616    ┆ 23616    ┆ 20.661417  │
│ Cambodia   ┆ null       ┆ null       ┆ 0       ┆ … ┆ 138719   ┆ 138719   ┆ 138719   ┆ 121.363955 │
│ Bahrain    ┆ null       ┆ null       ┆ 0       ┆ … ┆ 709858   ┆ 710306   ┆ 710693   ┆ 621.778653 │
└────────────┴────────────┴────────────┴─────────┴───┴──────────┴──────────┴──────────┴────────────┘

The alias method allows us to give a name to the groups.

Notice that the rows became out of order. Not to worry about order makes the code more efficient and does not affect future subsetting of our data frame. If you want to maintain the order however, you can use the maintain_order parameter:

df_countries = df_clean.group_by(
    (pl.col("Country/Region")).alias("Country"),
    maintain_order=True
    ).sum()

print(df_countries)

shape: (201, 1_147)
┌──────────────┬──────────────┬──────────────┬─────────┬───┬────────┬────────┬────────┬────────────┐
│ Country      ┆ Province/Sta ┆ Country/Regi ┆ 1/22/20 ┆ … ┆ 3/7/23 ┆ 3/8/23 ┆ 3/9/23 ┆ daily_avg  │
│ ---          ┆ te           ┆ on           ┆ ---     ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---        │
│ str          ┆ ---          ┆ ---          ┆ i64     ┆   ┆ i64    ┆ i64    ┆ i64    ┆ f64        │
│              ┆ str          ┆ str          ┆         ┆   ┆        ┆        ┆        ┆            │
╞══════════════╪══════════════╪══════════════╪═════════╪═══╪════════╪════════╪════════╪════════════╡
│ Afghanistan  ┆ null         ┆ null         ┆ 0       ┆ … ┆ 209436 ┆ 209451 ┆ 209451 ┆ 183.246719 │
│ Albania      ┆ null         ┆ null         ┆ 0       ┆ … ┆ 334427 ┆ 334443 ┆ 334457 ┆ 292.613298 │
│ Algeria      ┆ null         ┆ null         ┆ 0       ┆ … ┆ 271490 ┆ 271494 ┆ 271496 ┆ 237.529309 │
│ Andorra      ┆ null         ┆ null         ┆ 0       ┆ … ┆ 47875  ┆ 47890  ┆ 47890  ┆ 41.898513  │
│ Angola       ┆ null         ┆ null         ┆ 0       ┆ … ┆ 105277 ┆ 105288 ┆ 105288 ┆ 92.115486  │
│ …            ┆ …            ┆ …            ┆ …       ┆ … ┆ …      ┆ …      ┆ …      ┆ …          │
│ West Bank    ┆ null         ┆ null         ┆ 0       ┆ … ┆ 703228 ┆ 703228 ┆ 703228 ┆ 615.247594 │
│ and Gaza     ┆              ┆              ┆         ┆   ┆        ┆        ┆        ┆            │
│ Winter       ┆ null         ┆ null         ┆ 0       ┆ … ┆ 535    ┆ 535    ┆ 535    ┆ 0.468066   │
│ Olympics     ┆              ┆              ┆         ┆   ┆        ┆        ┆        ┆            │
│ 2022         ┆              ┆              ┆         ┆   ┆        ┆        ┆        ┆            │
│ Yemen        ┆ null         ┆ null         ┆ 0       ┆ … ┆ 11945  ┆ 11945  ┆ 11945  ┆ 10.450569  │
│ Zambia       ┆ null         ┆ null         ┆ 0       ┆ … ┆ 343135 ┆ 343135 ┆ 343135 ┆ 300.205599 │
│ Zimbabwe     ┆ null         ┆ null         ┆ 0       ┆ … ┆ 264127 ┆ 264276 ┆ 264276 ┆ 231.212598 │
└──────────────┴──────────────┴──────────────┴─────────┴───┴────────┴────────┴────────┴────────────┘

Your turn:

The old Country/Region column is now irrelevant. Remove it from df_countries.
How could you get the total number of cases for each day for the whole world?