DataFrames with pandas

Author

Marie-Hélène Burle

pandas is a Python library built to manipulate data frames and time series.

For this section, we will use the Covid-19 data from the Johns Hopkins University CSSE repository.

You can visualize this data in a dashboard created by the Johns Hopkins University Center for Systems Science and Engineering.

Setup

First, we need to load the pandas library and read in the data from the web:

# Load the pandas library and create a shorter name for it
import pandas as pd

# The global confirmed cases are available in CSV format at the url:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

# pandas allows to read in data from the web directly
cases = pd.read_csv(url)

First look at the data

What does our data look like?

cases
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
0 NaN Afghanistan 33.939110 67.709953 0 0 0 0 0 0 ... 209322 209340 209358 209362 209369 209390 209406 209436 209451 209451
1 NaN Albania 41.153300 20.168300 0 0 0 0 0 0 ... 334391 334408 334408 334427 334427 334427 334427 334427 334443 334457
2 NaN Algeria 28.033900 1.659600 0 0 0 0 0 0 ... 271441 271448 271463 271469 271469 271477 271477 271490 271494 271496
3 NaN Andorra 42.506300 1.521800 0 0 0 0 0 0 ... 47866 47875 47875 47875 47875 47875 47875 47875 47890 47890
4 NaN Angola -11.202700 17.873900 0 0 0 0 0 0 ... 105255 105277 105277 105277 105277 105277 105277 105277 105288 105288
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
284 NaN West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 ... 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228
285 NaN Winter Olympics 2022 39.904200 116.407400 0 0 0 0 0 0 ... 535 535 535 535 535 535 535 535 535 535
286 NaN Yemen 15.552727 48.516388 0 0 0 0 0 0 ... 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
287 NaN Zambia -13.133897 27.849332 0 0 0 0 0 0 ... 343012 343012 343079 343079 343079 343135 343135 343135 343135 343135
288 NaN Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 ... 263921 264127 264127 264127 264127 264127 264127 264127 264276 264276

289 rows × 1147 columns

# Quick summary of the data
cases.describe()
Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
count 287.000000 287.000000 289.000000 289.000000 289.000000 289.000000 289.000000 289.000000 289.000000 289.000000 ... 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02 2.890000e+02
mean 19.718719 22.182084 1.927336 2.273356 3.266436 4.972318 7.335640 10.134948 19.307958 21.346021 ... 2.336755e+06 2.337519e+06 2.338173e+06 2.338805e+06 2.338992e+06 2.339187e+06 2.339387e+06 2.339839e+06 2.340460e+06 2.341073e+06
std 25.956609 77.870931 26.173664 26.270191 32.707271 45.523871 63.623197 85.724481 210.329649 211.628535 ... 8.506608e+06 8.511285e+06 8.514488e+06 8.518031e+06 8.518408e+06 8.518645e+06 8.519346e+06 8.521641e+06 8.524968e+06 8.527765e+06
min -71.949900 -178.116500 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 4.072192 -32.823050 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04 1.456700e+04
50% 21.512583 20.939400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05 1.032480e+05
75% 40.401784 89.224350 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.051998e+06 1.052122e+06 1.052247e+06 1.052382e+06 1.052519e+06 1.052664e+06 1.052664e+06 1.052926e+06 1.053068e+06 1.053213e+06
max 71.706900 178.065000 444.000000 444.000000 549.000000 761.000000 1058.000000 1423.000000 3554.000000 3554.000000 ... 1.034435e+08 1.035339e+08 1.035898e+08 1.036487e+08 1.036508e+08 1.036470e+08 1.036555e+08 1.036909e+08 1.037558e+08 1.038027e+08

8 rows × 1145 columns

Of course, this value is meaningless for Lat and Long!

# Data types of the various columns
cases.dtypes
Province/State     object
Country/Region     object
Lat               float64
Long              float64
1/22/20             int64
                   ...   
3/5/23              int64
3/6/23              int64
3/7/23              int64
3/8/23              int64
3/9/23              int64
Length: 1147, dtype: object
cases.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Columns: 1147 entries, Province/State to 3/9/23
dtypes: float64(2), int64(1143), object(2)
memory usage: 2.5+ MB
cases.shape
(289, 1147)

Cases per country by date

The dataset is a time series: this means that we have the cumulative numbers up to each date.

# Let's get rid of the latitude and longitude to simplify our data
simple = cases.drop(columns=['Lat', 'Long'])
simple
Province/State Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
0 NaN Afghanistan 0 0 0 0 0 0 0 0 ... 209322 209340 209358 209362 209369 209390 209406 209436 209451 209451
1 NaN Albania 0 0 0 0 0 0 0 0 ... 334391 334408 334408 334427 334427 334427 334427 334427 334443 334457
2 NaN Algeria 0 0 0 0 0 0 0 0 ... 271441 271448 271463 271469 271469 271477 271477 271490 271494 271496
3 NaN Andorra 0 0 0 0 0 0 0 0 ... 47866 47875 47875 47875 47875 47875 47875 47875 47890 47890
4 NaN Angola 0 0 0 0 0 0 0 0 ... 105255 105277 105277 105277 105277 105277 105277 105277 105288 105288
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
284 NaN West Bank and Gaza 0 0 0 0 0 0 0 0 ... 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228
285 NaN Winter Olympics 2022 0 0 0 0 0 0 0 0 ... 535 535 535 535 535 535 535 535 535 535
286 NaN Yemen 0 0 0 0 0 0 0 0 ... 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
287 NaN Zambia 0 0 0 0 0 0 0 0 ... 343012 343012 343079 343079 343079 343135 343135 343135 343135 343135
288 NaN Zimbabwe 0 0 0 0 0 0 0 0 ... 263921 264127 264127 264127 264127 264127 264127 264127 264276 264276

289 rows × 1145 columns


Some countries (e.g. Australia) are split between several provinces or states so we will have to add the values of all their provinces/states to get their totals.

Here is how to make the sum for all Australian states:

Let’s first select all the data for Australia: we want all the rows for which the Country/Region column is equal to Australia.

First, we want to select the Country/Region column. There are several ways to index in pandas.

When indexing columns, one can use square brackets directly after the DataFrame to index:

simple['Country/Region']
0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object

However, it is more efficient to use the .loc or .iloc methods.

  • Use .loc when using labels or booleans:
simple.loc[:, 'Country/Region']
0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object
  • Use .iloc when using indices:
simple.iloc[:, 1]
0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object

Country/Region is the 2nd column, but indexing starts at 0 in Python.

Then we need a conditional to filter the rows for which the value is equal to Australia:

simple.loc[:, 'Country/Region'] == 'Australia'
0      False
1      False
2      False
3      False
4      False
       ...  
284    False
285    False
286    False
287    False
288    False
Name: Country/Region, Length: 289, dtype: bool

Finally, we index, out of our entire data frame, the rows for which that condition returns True:

simple.loc[simple.loc[:, 'Country/Region'] == 'Australia']
Province/State Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
9 Australian Capital Territory Australia 0 0 0 0 0 0 0 0 ... 232018 232018 232619 232619 232619 232619 232619 232619 232619 232974
10 New South Wales Australia 0 0 0 0 3 4 4 4 ... 3900969 3900969 3908129 3908129 3908129 3908129 3908129 3908129 3908129 3915992
11 Northern Territory Australia 0 0 0 0 0 0 0 0 ... 104931 104931 105021 105021 105021 105021 105021 105021 105021 105111
12 Queensland Australia 0 0 0 0 0 0 0 1 ... 1796633 1796633 1800236 1800236 1800236 1800236 1800236 1800236 1800236 1800236
13 South Australia Australia 0 0 0 0 0 0 0 0 ... 880207 880207 881911 881911 881911 881911 881911 881911 881911 883620
14 Tasmania Australia 0 0 0 0 0 0 0 0 ... 286264 286264 286264 286897 286897 286897 286897 286897 286897 287507
15 Victoria Australia 0 0 0 0 1 1 1 1 ... 2874262 2874262 2877260 2877260 2877260 2877260 2877260 2877260 2877260 2880559
16 Western Australia Australia 0 0 0 0 0 0 0 0 ... 1291077 1291077 1293461 1293461 1293461 1293461 1293461 1293461 1293461 1293461

8 rows × 1145 columns

Here we use .loc to index based on a boolean array.

We can now make the sum for all of Australia for each day:

total_australia = simple.loc[simple.loc[:, 'Country/Region'] == 'Australia'].sum(numeric_only=True)
total_australia
1/22/20           0
1/23/20           0
1/24/20           0
1/25/20           0
1/26/20           4
             ...   
3/5/23     11385534
3/6/23     11385534
3/7/23     11385534
3/8/23     11385534
3/9/23     11399460
Length: 1143, dtype: int64

We can do this for all countries by grouping them:

totals = simple.groupby('Country/Region').sum(numeric_only=True)
totals
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
Country/Region
Afghanistan 0 0 0 0 0 0 0 0 0 0 ... 209322 209340 209358 209362 209369 209390 209406 209436 209451 209451
Albania 0 0 0 0 0 0 0 0 0 0 ... 334391 334408 334408 334427 334427 334427 334427 334427 334443 334457
Algeria 0 0 0 0 0 0 0 0 0 0 ... 271441 271448 271463 271469 271469 271477 271477 271490 271494 271496
Andorra 0 0 0 0 0 0 0 0 0 0 ... 47866 47875 47875 47875 47875 47875 47875 47875 47890 47890
Angola 0 0 0 0 0 0 0 0 0 0 ... 105255 105277 105277 105277 105277 105277 105277 105277 105288 105288
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
West Bank and Gaza 0 0 0 0 0 0 0 0 0 0 ... 703228 703228 703228 703228 703228 703228 703228 703228 703228 703228
Winter Olympics 2022 0 0 0 0 0 0 0 0 0 0 ... 535 535 535 535 535 535 535 535 535 535
Yemen 0 0 0 0 0 0 0 0 0 0 ... 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
Zambia 0 0 0 0 0 0 0 0 0 0 ... 343012 343012 343079 343079 343079 343135 343135 343135 343135 343135
Zimbabwe 0 0 0 0 0 0 0 0 0 0 ... 263921 264127 264127 264127 264127 264127 264127 264127 264276 264276

201 rows × 1143 columns


Now, we can look at the totals for any date:

totals.loc[:, '6/12/21']
Country/Region
Afghanistan              88740
Albania                 132449
Algeria                 133070
Andorra                  13813
Angola                   36600
                         ...  
West Bank and Gaza      311018
Winter Olympics 2022         0
Yemen                     6857
Zambia                  110332
Zimbabwe                 39852
Name: 6/12/21, Length: 201, dtype: int64

To make it easier to read, let’s order those numbers by decreasing order:

totals.loc[:, '6/12/21'].sort_values(ascending=False)
Country/Region
US                      33573694
India                   29439989
Brazil                  17385952
France                   5799565
Turkey                   5325435
                          ...   
Palau                          0
Tonga                          0
Summer Olympics 2020           0
Tuvalu                         0
Winter Olympics 2022           0
Name: 6/12/21, Length: 201, dtype: int64

We can also index the data for a particular country by indexing a row instead of a column:

totals.loc['Albania', :]
1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     334427
3/6/23     334427
3/7/23     334427
3/8/23     334443
3/9/23     334457
Name: Albania, Length: 1143, dtype: int64

When indexing rows, this syntax can be simplified to:

totals.loc['Albania']
1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     334427
3/6/23     334427
3/7/23     334427
3/8/23     334443
3/9/23     334457
Name: Albania, Length: 1143, dtype: int64

Global totals

Now, what if we want to have the world totals for each day? We calculate the columns totals (i.e. the sum across countries):

totals.sum()
1/22/20          557
1/23/20          657
1/24/20          944
1/25/20         1437
1/26/20         2120
             ...    
3/5/23     676024901
3/6/23     676082941
3/7/23     676213378
3/8/23     676392824
3/9/23     676570149
Length: 1143, dtype: int64

Your turn:

How many confirmed cases were there in Venezuela by March 10, 2021?

First, we need to select the data for Venezuela:

venez = totals.loc['Venezuela']
venez
1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     552051
3/6/23     552125
3/7/23     552157
3/8/23     552157
3/9/23     552162
Name: Venezuela, Length: 1143, dtype: int64

Then, we need to select for the proper date:

answer = venez.loc['3/10/21']
answer
np.int64(143321)

We could have done it at once by indexing the row and column:

totals.loc['Venezuela', '3/10/21']
np.int64(143321)

pandas documentation