DataFrames with pandas

Author

Marie-Hélène Burle

This workshop is an introduction to pandas, a Python library to manipulate DataFrames and time series, via an exploration of the Covid-19 data from Johns Hopkins University CSSE repository.

Setup

First, we need to load the pandas library and read in the data from the web:

# Load the pandas library and create a shorter name for it
import pandas as pd

# The global confirmed cases are available in CSV format at the url:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

# pandas allows to read in data from the web directly
cases = pd.read_csv(url)

First look at the data

What does our data look like?

cases

	Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
0	NaN	Afghanistan	33.939110	67.709953	0	0	0	0	0	0	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
1	NaN	Albania	41.153300	20.168300	0	0	0	0	0	0	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
2	NaN	Algeria	28.033900	1.659600	0	0	0	0	0	0	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
3	NaN	Andorra	42.506300	1.521800	0	0	0	0	0	0	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
4	NaN	Angola	-11.202700	17.873900	0	0	0	0	0	0	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284	NaN	West Bank and Gaza	31.952200	35.233200	0	0	0	0	0	0	...	703228	703228	703228	703228	703228	703228	703228	703228	703228	703228
285	NaN	Winter Olympics 2022	39.904200	116.407400	0	0	0	0	0	0	...	535	535	535	535	535	535	535	535	535	535
286	NaN	Yemen	15.552727	48.516388	0	0	0	0	0	0	...	11945	11945	11945	11945	11945	11945	11945	11945	11945	11945
287	NaN	Zambia	-13.133897	27.849332	0	0	0	0	0	0	...	343012	343012	343079	343079	343079	343135	343135	343135	343135	343135
288	NaN	Zimbabwe	-19.015438	29.154857	0	0	0	0	0	0	...	263921	264127	264127	264127	264127	264127	264127	264127	264276	264276

289 rows × 1147 columns

# Quick summary of the data
cases.describe()

	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
count	287.000000	287.000000	289.000000	289.000000	289.000000	289.000000	289.000000	289.000000	289.000000	289.000000	...	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02	2.890000e+02
mean	19.718719	22.182084	1.927336	2.273356	3.266436	4.972318	7.335640	10.134948	19.307958	21.346021	...	2.336755e+06	2.337519e+06	2.338173e+06	2.338805e+06	2.338992e+06	2.339187e+06	2.339387e+06	2.339839e+06	2.340460e+06	2.341073e+06
std	25.956609	77.870931	26.173664	26.270191	32.707271	45.523871	63.623197	85.724481	210.329649	211.628535	...	8.506608e+06	8.511285e+06	8.514488e+06	8.518031e+06	8.518408e+06	8.518645e+06	8.519346e+06	8.521641e+06	8.524968e+06	8.527765e+06
min	-71.949900	-178.116500	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	4.072192	-32.823050	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04	1.456700e+04
50%	21.512583	20.939400	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05	1.032480e+05
75%	40.401784	89.224350	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.051998e+06	1.052122e+06	1.052247e+06	1.052382e+06	1.052519e+06	1.052664e+06	1.052664e+06	1.052926e+06	1.053068e+06	1.053213e+06
max	71.706900	178.065000	444.000000	444.000000	549.000000	761.000000	1058.000000	1423.000000	3554.000000	3554.000000	...	1.034435e+08	1.035339e+08	1.035898e+08	1.036487e+08	1.036508e+08	1.036470e+08	1.036555e+08	1.036909e+08	1.037558e+08	1.038027e+08

8 rows × 1145 columns

Of course, this value is meaningless for Lat and Long!

# Data types of the various columns
cases.dtypes

Province/State     object
Country/Region     object
Lat               float64
Long              float64
1/22/20             int64
                   ...   
3/5/23              int64
3/6/23              int64
3/7/23              int64
3/8/23              int64
3/9/23              int64
Length: 1147, dtype: object

cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Columns: 1147 entries, Province/State to 3/9/23
dtypes: float64(2), int64(1143), object(2)
memory usage: 2.5+ MB

cases.shape

(289, 1147)

Cases per country by date

The dataset is a time series: this means that we have the cumulative numbers up to each date.

# Let's get rid of the latitude and longitude to simplify our data
simple = cases.drop(columns=['Lat', 'Long'])
simple

	Province/State	Country/Region	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
0	NaN	Afghanistan	0	0	0	0	0	0	0	0	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
1	NaN	Albania	0	0	0	0	0	0	0	0	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
2	NaN	Algeria	0	0	0	0	0	0	0	0	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
3	NaN	Andorra	0	0	0	0	0	0	0	0	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
4	NaN	Angola	0	0	0	0	0	0	0	0	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284	NaN	West Bank and Gaza	0	0	0	0	0	0	0	0	...	703228	703228	703228	703228	703228	703228	703228	703228	703228	703228
285	NaN	Winter Olympics 2022	0	0	0	0	0	0	0	0	...	535	535	535	535	535	535	535	535	535	535
286	NaN	Yemen	0	0	0	0	0	0	0	0	...	11945	11945	11945	11945	11945	11945	11945	11945	11945	11945
287	NaN	Zambia	0	0	0	0	0	0	0	0	...	343012	343012	343079	343079	343079	343135	343135	343135	343135	343135
288	NaN	Zimbabwe	0	0	0	0	0	0	0	0	...	263921	264127	264127	264127	264127	264127	264127	264127	264276	264276

289 rows × 1145 columns

Some countries (e.g. Australia) are split between several provinces or states so we will have to add the values of all their provinces/states to get their totals.

Here is how to make the sum for all Australian states:

Let’s first select all the data for Australia: we want all the rows for which the Country/Region column is equal to Australia.

First, we want to select the Country/Region column. There are several ways to index in pandas.

When indexing columns, one can use square brackets directly after the DataFrame to index:

simple['Country/Region']

0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object

However, it is more efficient to use the .loc or .iloc methods.

Use .loc when using labels or booleans:

simple.loc[:, 'Country/Region']

0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object

Use .iloc when using indices:

simple.iloc[:, 1]

0               Afghanistan
1                   Albania
2                   Algeria
3                   Andorra
4                    Angola
               ...         
284      West Bank and Gaza
285    Winter Olympics 2022
286                   Yemen
287                  Zambia
288                Zimbabwe
Name: Country/Region, Length: 289, dtype: object

Country/Region is the 2nd column, but indexing starts at 0 in Python.

Then we need a conditional to filter the rows for which the value is equal to Australia:

simple.loc[:, 'Country/Region'] == 'Australia'

0      False
1      False
2      False
3      False
4      False
       ...  
284    False
285    False
286    False
287    False
288    False
Name: Country/Region, Length: 289, dtype: bool

Finally, we index, out of our entire DataFrame, the rows for which that condition returns True:

simple.loc[simple.loc[:, 'Country/Region'] == 'Australia']

	Province/State	Country/Region	1/26/20	1/27/20	1/28/20	1/29/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
9	Australian Capital Territory	Australia	0	0	0	0	...	232018	232018	232619	232619	232619	232619	232619	232619	232619	232974
10	New South Wales	Australia	3	4	4	4	...	3900969	3900969	3908129	3908129	3908129	3908129	3908129	3908129	3908129	3915992
11	Northern Territory	Australia	0	0	0	0	...	104931	104931	105021	105021	105021	105021	105021	105021	105021	105111
12	Queensland	Australia	0	0	0	1	...	1796633	1796633	1800236	1800236	1800236	1800236	1800236	1800236	1800236	1800236
13	South Australia	Australia	0	0	0	0	...	880207	880207	881911	881911	881911	881911	881911	881911	881911	883620
14	Tasmania	Australia	0	0	0	0	...	286264	286264	286264	286897	286897	286897	286897	286897	286897	287507
15	Victoria	Australia	1	1	1	1	...	2874262	2874262	2877260	2877260	2877260	2877260	2877260	2877260	2877260	2880559
16	Western Australia	Australia	0	0	0	0	...	1291077	1291077	1293461	1293461	1293461	1293461	1293461	1293461	1293461	1293461

8 rows × 1145 columns

Here we use .loc to index based on a boolean array.

We can now make the sum for all of Australia for each day:

total_australia = simple.loc[simple.loc[:, 'Country/Region'] == 'Australia'].sum(numeric_only=True)
total_australia

1/22/20           0
1/23/20           0
1/24/20           0
1/25/20           0
1/26/20           4
             ...   
3/5/23     11385534
3/6/23     11385534
3/7/23     11385534
3/8/23     11385534
3/9/23     11399460
Length: 1143, dtype: int64

We can do this for all countries by grouping them:

totals = simple.groupby('Country/Region').sum(numeric_only=True)
totals

	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	1/30/20	1/31/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
Country/Region
Afghanistan	0	0	0	0	0	0	0	0	0	0	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
Albania	0	0	0	0	0	0	0	0	0	0	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
Algeria	0	0	0	0	0	0	0	0	0	0	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
Andorra	0	0	0	0	0	0	0	0	0	0	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
Angola	0	0	0	0	0	0	0	0	0	0	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
West Bank and Gaza	0	0	0	0	0	0	0	0	0	0	...	703228	703228	703228	703228	703228	703228	703228	703228	703228	703228
Winter Olympics 2022	0	0	0	0	0	0	0	0	0	0	...	535	535	535	535	535	535	535	535	535	535
Yemen	0	0	0	0	0	0	0	0	0	0	...	11945	11945	11945	11945	11945	11945	11945	11945	11945	11945
Zambia	0	0	0	0	0	0	0	0	0	0	...	343012	343012	343079	343079	343079	343135	343135	343135	343135	343135
Zimbabwe	0	0	0	0	0	0	0	0	0	0	...	263921	264127	264127	264127	264127	264127	264127	264127	264276	264276

201 rows × 1143 columns

Now, we can look at the totals for any date:

totals.loc[:, '6/12/21']

Country/Region
Afghanistan              88740
Albania                 132449
Algeria                 133070
Andorra                  13813
Angola                   36600
                         ...  
West Bank and Gaza      311018
Winter Olympics 2022         0
Yemen                     6857
Zambia                  110332
Zimbabwe                 39852
Name: 6/12/21, Length: 201, dtype: int64

To make it easier to read, let’s order those numbers by decreasing order:

totals.loc[:, '6/12/21'].sort_values(ascending=False)

Country/Region
US                      33573694
India                   29439989
Brazil                  17385952
France                   5799565
Turkey                   5325435
                          ...   
Palau                          0
Tonga                          0
Summer Olympics 2020           0
Tuvalu                         0
Winter Olympics 2022           0
Name: 6/12/21, Length: 201, dtype: int64

We can also index the data for a particular country by indexing a row instead of a column:

totals.loc['Albania', :]

1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     334427
3/6/23     334427
3/7/23     334427
3/8/23     334443
3/9/23     334457
Name: Albania, Length: 1143, dtype: int64

When indexing rows, this syntax can be simplified to:

totals.loc['Albania']

1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     334427
3/6/23     334427
3/7/23     334427
3/8/23     334443
3/9/23     334457
Name: Albania, Length: 1143, dtype: int64

Global totals

Now, what if we want to have the world totals for each day? We calculate the columns totals (i.e. the sum across countries):

totals.sum()

1/22/20          557
1/23/20          657
1/24/20          944
1/25/20         1437
1/26/20         2120
             ...    
3/5/23     676024901
3/6/23     676082941
3/7/23     676213378
3/8/23     676392824
3/9/23     676570149
Length: 1143, dtype: int64

Your turn:

How many confirmed cases were there in Venezuela by March 10, 2021?

Solution

First, we need to select the data for Venezuela:

venez = totals.loc['Venezuela']
venez

1/22/20         0
1/23/20         0
1/24/20         0
1/25/20         0
1/26/20         0
            ...  
3/5/23     552051
3/6/23     552125
3/7/23     552157
3/8/23     552157
3/9/23     552162
Name: Venezuela, Length: 1143, dtype: int64

Then, we need to select for the proper date:

answer = venez.loc['3/10/21']
answer

np.int64(143321)

We could have done it at once by indexing the row and column:

totals.loc['Venezuela', '3/10/21']

np.int64(143321)

Setup

First look at the data

Cases per country by date

Global totals

pandas documentation