DataFrames with Polars

Author

Marie-Hélène Burle

Polars is a Python library built to manipulate data frames and time series.

For this section, we will use the Covid-19 data from the Johns Hopkins University CSSE repository.

You can visualize this data in a dashboard created by the Johns Hopkins University Center for Systems Science and Engineering.

Setup

First, we need to load the polars library and read in the data from the web:

# Load the Polars library and create a shorter name for it
import polars as pl

# The global confirmed cases are available in CSV format at the url:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

# Polars allows to read in data from the web directly
cases = pl.read_csv(url)

First look at the data

What does our data look like?

cases
shape: (289, 1_147)
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/1/23 2/2/23 2/3/23 2/4/23 2/5/23 2/6/23 2/7/23 2/8/23 2/9/23 2/10/23 2/11/23 2/12/23 2/13/23 2/14/23 2/15/23 2/16/23 2/17/23 2/18/23 2/19/23 2/20/23 2/21/23 2/22/23 2/23/23 2/24/23 2/25/23 2/26/23 2/27/23 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
str str f64 f64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64 i64
null "Afghanistan" 33.93911 67.709953 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 208552 208669 208669 208621 208627 208704 208721 208771 208771 208943 208971 208982 209011 209036 209056 209072 209083 209084 209107 209153 209181 209181 209215 209230 209246 209274 209308 209322 209340 209358 209362 209369 209390 209406 209436 209451 209451
null "Albania" 41.1533 20.1683 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 334177 334187 334203 334204 334211 334211 334211 334222 334229 334229 334234 334255 334255 334264 334264 334273 334291 334305 334314 334315 334336 334336 334345 334356 334373 334378 334380 334391 334408 334408 334427 334427 334427 334427 334427 334443 334457
null "Algeria" 28.0339 1.6596 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 271385 271386 271394 271394 271394 271395 271399 271403 271406 271406 271409 271409 271409 271409 271421 271424 271424 271425 271425 271426 271428 271431 271432 271436 271439 271440 271440 271441 271448 271463 271469 271469 271477 271477 271490 271494 271496
null "Andorra" 42.5063 1.5218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 47839 47839 47850 47850 47850 47850 47850 47850 47860 47860 47860 47860 47860 47860 47860 47866 47866 47866 47866 47866 47866 47866 47866 47866 47866 47866 47866 47866 47875 47875 47875 47875 47875 47875 47875 47890 47890
null "Angola" -11.2027 17.8739 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105184 105205 105205 105205 105205 105205 105255 105255 105277 105277 105277 105277 105277 105277 105277 105288 105288
null "West Bank and Gaza" 31.9522 35.2332 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228
null "Winter Olympics 2022" 39.9042 116.4074 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535 535
null "Yemen" 15.552727 48.516388 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
null "Zambia" -13.133897 27.849332 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 340944 341261 341397 341573 341573 341690 341690 341936 342114 342114 342288 342288 342317 342317 342317 342317 342317 342317 342317 342724 342782 342831 342831 342831 342831 342831 342831 343012 343012 343079 343079 343079 343135 343135 343135 343135 343135
null "Zimbabwe" -19.015438 29.154857 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 262324 262324 262324 262324 262324 262324 262324 263083 263083 263083 263083 263083 263083 263083 263642 263642 263642 263642 263642 263642 263642 263921 263921 263921 263921 263921 263921 263921 264127 264127 264127 264127 264127 264127 264127 264276 264276
# Quick summary of the data
cases.describe()
shape: (9, 1_148)
statistic Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/1/23 2/2/23 2/3/23 2/4/23 2/5/23 2/6/23 2/7/23 2/8/23 2/9/23 2/10/23 2/11/23 2/12/23 2/13/23 2/14/23 2/15/23 2/16/23 2/17/23 2/18/23 2/19/23 2/20/23 2/21/23 2/22/23 2/23/23 2/24/23 2/25/23 2/26/23 2/27/23 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
str str str f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
"count" "91" "289" 287.0 287.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0 289.0
"null_count" "198" "0" 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
"mean" null null 19.718719 22.182084 1.927336 2.273356 3.266436 4.972318 7.33564 10.134948 19.307958 21.346021 28.50173 34.349481 41.653979 58.086505 68.813149 82.695502 95.653979 106.595156 119.031142 128.480969 138.968858 147.99654 155.065744 156.512111 208.941176 231.529412 238.944637 246.49827 253.539792 260.051903 261.782007 263.723183 265.903114 272.0 2.3219e6 2.3228e6 2.3236e6 2.3240e6 2.3243e6 2.3248e6 2.3255e6 2.3263e6 2.3272e6 2.3277e6 2.3281e6 2.3284e6 2.3289e6 2.3295e6 2.3304e6 2.3311e6 2.3318e6 2.3321e6 2.3324e6 2.3327e6 2.3333e6 2.3342e6 2.3349e6 2.3354e6 2.3356e6 2.3358e6 2.3362e6 2.3368e6 2.3375e6 2.3382e6 2.3388e6 2.3390e6 2.3392e6 2.3394e6 2.3398e6 2.3405e6 2.3411e6
"std" null null 25.956609 77.870931 26.173664 26.270191 32.707271 45.523871 63.623197 85.724481 210.329649 211.628535 291.630499 345.759248 425.471558 661.171033 799.258224 985.038344 1160.785189 1304.872173 1472.006165 1598.378021 1747.162098 1870.449899 1966.785835 1967.121793 2838.127785 3202.384554 3310.754868 3424.390905 3530.604364 3630.11703 3650.660287 3674.805102 3687.812705 3771.362586 8.4392e6 8.4439e6 8.4473e6 8.4483e6 8.4491e6 8.4511e6 8.4544e6 8.4592e6 8.4639e6 8.4655e6 8.4669e6 8.4674e6 8.4695e6 8.4726e6 8.4776e6 8.4812e6 8.4843e6 8.4852e6 8.4857e6 8.4865e6 8.4900e6 8.4957e6 8.4993e6 8.5010e6 8.5014e6 8.5019e6 8.5037e6 8.5066e6 8.5113e6 8.5145e6 8.5180e6 8.5184e6 8.5186e6 8.5193e6 8.5216e6 8.5250e6 8.5278e6
"min" "Alberta" "Afghanistan" -71.9499 -178.1165 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
"25%" null null 4.210484 -23.0418 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0 14567.0
"50%" null null 21.512583 20.9394 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0 103248.0
"75%" null null 40.463667 90.3563 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.049457e6 1.049537e6 1.04964e6 1.049729e6 1.049828e6 1.049944e6 1.049944e6 1.05e6 1.050127e6 1.050241e6 1.05033e6 1.050402e6 1.050402e6 1.050549e6 1.050621e6 1.050736e6 1.050824e6 1.050961e6 1.051079e6 1.051162e6 1.051241e6 1.051336e6 1.051457e6 1.051589e6 1.051732e6 1.051862e6 1.05193e6 1.051998e6 1.052122e6 1.052247e6 1.052382e6 1.052519e6 1.052664e6 1.052664e6 1.052926e6 1.053068e6 1.053213e6
"max" "Zhejiang" "Zimbabwe" 71.7069 178.065 444.0 444.0 549.0 761.0 1058.0 1423.0 3554.0 3554.0 4903.0 5806.0 7153.0 11177.0 13522.0 16678.0 19665.0 22112.0 24953.0 27100.0 29631.0 31728.0 33366.0 33366.0 48206.0 54406.0 56249.0 58182.0 59989.0 61682.0 62031.0 62442.0 62662.0 64084.0 1.02479379e8 1.02561054e8 1.02599784e8 1.02603942e8 1.02606868e8 1.02631179e8 1.02676694e8 1.02760211e8 1.02842061e8 1.02855101e8 1.0285951e8 1.02862879e8 1.02886005e8 1.02934224e8 1.03023231e8 1.0308391e8 1.03131898e8 1.03134605e8 1.03136077e8 1.03138119e8 1.03198669e8 1.03308832e8 1.03365511e8 1.03378408e8 1.03381157e8 1.03382763e8 1.03399827e8 1.03443455e8 1.03533872e8 1.03589757e8 1.0364869e8 1.03650837e8 1.03646975e8 1.03655539e8 1.0369091e8 1.03755771e8 1.03802702e8

Of course, this value is meaningless for Lat and Long!

# Data types of the various columns
# cases.dtypes
# cases.info()
cases.shape
(289, 1147)

Cases per country by date

The dataset is a time series: this means that we have the cumulative numbers up to each date.

# Let's get rid of the latitude and longitude to simplify our data
simple = cases.drop(columns=['Lat', 'Long'])
simple
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 # Let's get rid of the latitude and longitude to simplify our data
----> 2 simple = cases.drop(columns=['Lat', 'Long'])
      3 simple

TypeError: DataFrame.drop() got an unexpected keyword argument 'columns'


Some countries (e.g. Australia) are split between several provinces or states so we will have to add the values of all their provinces/states to get their totals.

Here is how to make the sum for all Australian states:

Let’s first select all the data for Australia: we want all the rows for which the Country/Region column is equal to Australia.

First, we want to select the Country/Region column. There are several ways to index in Polars.

When indexing columns, one can use square brackets directly after the DataFrame to index:

simple['Country/Region']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 simple['Country/Region']

NameError: name 'simple' is not defined

However, it is more efficient to use the .loc or .iloc methods.

  • Use .loc when using labels or booleans:
# simple.loc[:, 'Country/Region']
  • Use .iloc when using indices:
# simple.iloc[:, 1]

Country/Region is the 2nd column, but indexing starts at 0 in Python.

Then we need a conditional to filter the rows for which the value is equal to Australia:

# simple.loc[:, 'Country/Region'] == 'Australia'

Finally, we index, out of our entire data frame, the rows for which that condition returns True:

# simple.loc[simple.loc[:, 'Country/Region'] == 'Australia']

Here we use .loc to index based on a boolean array.

We can now make the sum for all of Australia for each day:

total_australia = simple.loc[simple.loc[:, 'Country/Region'] == 'Australia'].sum(numeric_only=True)
total_australia
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 total_australia = simple.loc[simple.loc[:, 'Country/Region'] == 'Australia'].sum(numeric_only=True)
      2 total_australia

NameError: name 'simple' is not defined

We can do this for all countries by grouping them:

totals = simple.groupby('Country/Region').sum(numeric_only=True)
totals
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 totals = simple.groupby('Country/Region').sum(numeric_only=True)
      2 totals

NameError: name 'simple' is not defined


Now, we can look at the totals for any date:

totals.loc[:, '6/12/21']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 totals.loc[:, '6/12/21']

NameError: name 'totals' is not defined

To make it easier to read, let’s order those numbers by decreasing order:

totals.loc[:, '6/12/21'].sort_values(ascending=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 totals.loc[:, '6/12/21'].sort_values(ascending=False)

NameError: name 'totals' is not defined

We can also index the data for a particular country by indexing a row instead of a column:

totals.loc['Albania', :]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 totals.loc['Albania', :]

NameError: name 'totals' is not defined

When indexing rows, this syntax can be simplified to:

totals.loc['Albania']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 totals.loc['Albania']

NameError: name 'totals' is not defined

Global totals

Now, what if we want to have the world totals for each day? We calculate the columns totals (i.e. the sum across countries):

totals.sum()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 totals.sum()

NameError: name 'totals' is not defined

Your turn:

How many confirmed cases were there in Venezuela by March 10, 2021?

First, we need to select the data for Venezuela:

venez = totals.loc['Venezuela']
venez
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 venez = totals.loc['Venezuela']
      2 venez

NameError: name 'totals' is not defined

Then, we need to select for the proper date:

answer = venez.loc['3/10/21']
answer
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 answer = venez.loc['3/10/21']
      2 answer

NameError: name 'venez' is not defined

We could have done it at once by indexing the row and column:

totals.loc['Venezuela', '3/10/21']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 totals.loc['Venezuela', '3/10/21']

NameError: name 'totals' is not defined

Polars documentation