# Load the pandas library and create a shorter name for it
import pandas as pd
# The global confirmed cases are available in CSV format at the url:
= "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url
# pandas allows to read in data from the web directly
= pd.read_csv(url) cases
Data frames with pandas
pandas is a Python library built to manipulate data frames and time series.
For this section, we will use the Covid-19 data from the Johns Hopkins University CSSE repository.
You can visualize this data in a dashboard created by the Johns Hopkins University Center for Systems Science and Engineering.
Setup
First, we need to load the pandas
library and read in the data from the web:
First look at the data
What does our data look like?
cases
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 33.939110 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 |
1 | NaN | Albania | 41.153300 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 |
2 | NaN | Algeria | 28.033900 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 |
3 | NaN | Andorra | 42.506300 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 |
4 | NaN | Angola | -11.202700 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
284 | NaN | West Bank and Gaza | 31.952200 | 35.233200 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 |
285 | NaN | Winter Olympics 2022 | 39.904200 | 116.407400 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 |
286 | NaN | Yemen | 15.552727 | 48.516388 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 |
287 | NaN | Zambia | -13.133897 | 27.849332 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 343012 | 343012 | 343079 | 343079 | 343079 | 343135 | 343135 | 343135 | 343135 | 343135 |
288 | NaN | Zimbabwe | -19.015438 | 29.154857 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 263921 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264276 | 264276 |
289 rows × 1147 columns
# Quick summary of the data
cases.describe()
Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 287.000000 | 287.000000 | 289.000000 | 289.000000 | 289.000000 | 289.000000 | 289.000000 | 289.000000 | 289.000000 | 289.000000 | ... | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 | 2.890000e+02 |
mean | 19.718719 | 22.182084 | 1.927336 | 2.273356 | 3.266436 | 4.972318 | 7.335640 | 10.134948 | 19.307958 | 21.346021 | ... | 2.336755e+06 | 2.337519e+06 | 2.338173e+06 | 2.338805e+06 | 2.338992e+06 | 2.339187e+06 | 2.339387e+06 | 2.339839e+06 | 2.340460e+06 | 2.341073e+06 |
std | 25.956609 | 77.870931 | 26.173664 | 26.270191 | 32.707271 | 45.523871 | 63.623197 | 85.724481 | 210.329649 | 211.628535 | ... | 8.506608e+06 | 8.511285e+06 | 8.514488e+06 | 8.518031e+06 | 8.518408e+06 | 8.518645e+06 | 8.519346e+06 | 8.521641e+06 | 8.524968e+06 | 8.527765e+06 |
min | -71.949900 | -178.116500 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
25% | 4.072192 | -32.823050 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 | 1.456700e+04 |
50% | 21.512583 | 20.939400 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 | 1.032480e+05 |
75% | 40.401784 | 89.224350 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.051998e+06 | 1.052122e+06 | 1.052247e+06 | 1.052382e+06 | 1.052519e+06 | 1.052664e+06 | 1.052664e+06 | 1.052926e+06 | 1.053068e+06 | 1.053213e+06 |
max | 71.706900 | 178.065000 | 444.000000 | 444.000000 | 549.000000 | 761.000000 | 1058.000000 | 1423.000000 | 3554.000000 | 3554.000000 | ... | 1.034435e+08 | 1.035339e+08 | 1.035898e+08 | 1.036487e+08 | 1.036508e+08 | 1.036470e+08 | 1.036555e+08 | 1.036909e+08 | 1.037558e+08 | 1.038027e+08 |
8 rows × 1145 columns
Of course, this value is meaningless for Lat
and Long
!
# Data types of the various columns
cases.dtypes
Province/State object
Country/Region object
Lat float64
Long float64
1/22/20 int64
...
3/5/23 int64
3/6/23 int64
3/7/23 int64
3/8/23 int64
3/9/23 int64
Length: 1147, dtype: object
cases.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Columns: 1147 entries, Province/State to 3/9/23
dtypes: float64(2), int64(1143), object(2)
memory usage: 2.5+ MB
cases.shape
(289, 1147)
Cases per country by date
The dataset is a time series: this means that we have the cumulative numbers up to each date.
# Let's get rid of the latitude and longitude to simplify our data
= cases.drop(columns=['Lat', 'Long'])
simple simple
Province/State | Country/Region | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 |
1 | NaN | Albania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 |
2 | NaN | Algeria | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 |
3 | NaN | Andorra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 |
4 | NaN | Angola | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
284 | NaN | West Bank and Gaza | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 |
285 | NaN | Winter Olympics 2022 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 |
286 | NaN | Yemen | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 |
287 | NaN | Zambia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 343012 | 343012 | 343079 | 343079 | 343079 | 343135 | 343135 | 343135 | 343135 | 343135 |
288 | NaN | Zimbabwe | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 263921 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264276 | 264276 |
289 rows × 1145 columns
Some countries (e.g. Australia) are split between several provinces or states so we will have to add the values of all their provinces/states to get their totals.
Here is how to make the sum for all Australian states:
Let’s first select all the data for Australia: we want all the rows for which the Country/Region
column is equal to Australia
.
First, we want to select the Country/Region
column. There are several ways to index in pandas.
When indexing columns, one can use square brackets directly after the DataFrame to index:
'Country/Region'] simple[
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
284 West Bank and Gaza
285 Winter Olympics 2022
286 Yemen
287 Zambia
288 Zimbabwe
Name: Country/Region, Length: 289, dtype: object
However, it is more efficient to use the .loc
or .iloc
methods.
- Use
.loc
when using labels or booleans:
'Country/Region'] simple.loc[:,
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
284 West Bank and Gaza
285 Winter Olympics 2022
286 Yemen
287 Zambia
288 Zimbabwe
Name: Country/Region, Length: 289, dtype: object
- Use
.iloc
when using indices:
1] simple.iloc[:,
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
284 West Bank and Gaza
285 Winter Olympics 2022
286 Yemen
287 Zambia
288 Zimbabwe
Name: Country/Region, Length: 289, dtype: object
Country/Region
is the 2nd column, but indexing starts at 0 in Python.
Then we need a conditional to filter the rows for which the value is equal to Australia
:
'Country/Region'] == 'Australia' simple.loc[:,
0 False
1 False
2 False
3 False
4 False
...
284 False
285 False
286 False
287 False
288 False
Name: Country/Region, Length: 289, dtype: bool
Finally, we index, out of our entire data frame, the rows for which that condition returns True
:
'Country/Region'] == 'Australia'] simple.loc[simple.loc[:,
Province/State | Country/Region | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | Australian Capital Territory | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 232018 | 232018 | 232619 | 232619 | 232619 | 232619 | 232619 | 232619 | 232619 | 232974 |
10 | New South Wales | Australia | 0 | 0 | 0 | 0 | 3 | 4 | 4 | 4 | ... | 3900969 | 3900969 | 3908129 | 3908129 | 3908129 | 3908129 | 3908129 | 3908129 | 3908129 | 3915992 |
11 | Northern Territory | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 104931 | 104931 | 105021 | 105021 | 105021 | 105021 | 105021 | 105021 | 105021 | 105111 |
12 | Queensland | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1796633 | 1796633 | 1800236 | 1800236 | 1800236 | 1800236 | 1800236 | 1800236 | 1800236 | 1800236 |
13 | South Australia | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 880207 | 880207 | 881911 | 881911 | 881911 | 881911 | 881911 | 881911 | 881911 | 883620 |
14 | Tasmania | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 286264 | 286264 | 286264 | 286897 | 286897 | 286897 | 286897 | 286897 | 286897 | 287507 |
15 | Victoria | Australia | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ... | 2874262 | 2874262 | 2877260 | 2877260 | 2877260 | 2877260 | 2877260 | 2877260 | 2877260 | 2880559 |
16 | Western Australia | Australia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1291077 | 1291077 | 1293461 | 1293461 | 1293461 | 1293461 | 1293461 | 1293461 | 1293461 | 1293461 |
8 rows × 1145 columns
Here we use .loc
to index based on a boolean array.
We can now make the sum for all of Australia for each day:
= simple.loc[simple.loc[:, 'Country/Region'] == 'Australia'].sum(numeric_only=True)
total_australia total_australia
1/22/20 0
1/23/20 0
1/24/20 0
1/25/20 0
1/26/20 4
...
3/5/23 11385534
3/6/23 11385534
3/7/23 11385534
3/8/23 11385534
3/9/23 11399460
Length: 1143, dtype: int64
We can do this for all countries by grouping them:
= simple.groupby('Country/Region').sum(numeric_only=True)
totals totals
1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | 1/30/20 | 1/31/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country/Region | |||||||||||||||||||||
Afghanistan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 |
Albania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 |
Algeria | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 |
Andorra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 |
Angola | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
West Bank and Gaza | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 |
Winter Olympics 2022 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 |
Yemen | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 |
Zambia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 343012 | 343012 | 343079 | 343079 | 343079 | 343135 | 343135 | 343135 | 343135 | 343135 |
Zimbabwe | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 263921 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264276 | 264276 |
201 rows × 1143 columns
Now, we can look at the totals for any date:
'6/12/21'] totals.loc[:,
Country/Region
Afghanistan 88740
Albania 132449
Algeria 133070
Andorra 13813
Angola 36600
...
West Bank and Gaza 311018
Winter Olympics 2022 0
Yemen 6857
Zambia 110332
Zimbabwe 39852
Name: 6/12/21, Length: 201, dtype: int64
To make it easier to read, let’s order those numbers by decreasing order:
'6/12/21'].sort_values(ascending=False) totals.loc[:,
Country/Region
US 33573694
India 29439989
Brazil 17385952
France 5799565
Turkey 5325435
...
Palau 0
Tonga 0
Summer Olympics 2020 0
Tuvalu 0
Winter Olympics 2022 0
Name: 6/12/21, Length: 201, dtype: int64
We can also index the data for a particular country by indexing a row instead of a column:
'Albania', :] totals.loc[
1/22/20 0
1/23/20 0
1/24/20 0
1/25/20 0
1/26/20 0
...
3/5/23 334427
3/6/23 334427
3/7/23 334427
3/8/23 334443
3/9/23 334457
Name: Albania, Length: 1143, dtype: int64
When indexing rows, this syntax can be simplified to:
'Albania'] totals.loc[
1/22/20 0
1/23/20 0
1/24/20 0
1/25/20 0
1/26/20 0
...
3/5/23 334427
3/6/23 334427
3/7/23 334427
3/8/23 334443
3/9/23 334457
Name: Albania, Length: 1143, dtype: int64
Global totals
Now, what if we want to have the world totals for each day? We calculate the columns totals (i.e. the sum across countries):
sum() totals.
1/22/20 557
1/23/20 657
1/24/20 944
1/25/20 1437
1/26/20 2120
...
3/5/23 676024901
3/6/23 676082941
3/7/23 676213378
3/8/23 676392824
3/9/23 676570149
Length: 1143, dtype: int64
Your turn:
How many confirmed cases were there in Venezuela by March 10, 2021?