library(dslabs)
library(dplyr)
Data extraction
It is often useful to focus on sections of the data to plot or analyse. In this section, we will see how to extract various elements of the us_contagious_diseases
dataset from the dslabs package.
Load packages
One of the tidyverse packages is very useful for data manipulation: dplyr. Let’s load the dslabs
package again as well as dplyr
:
Indexing
You can extract a subset of the data using their position by indexing. Indexing in R starts with 1
(in many languages, the first index is 0
) and it is done with square brackets. Since a data frame has two dimensions, there are two possible indices in the square brackets:
- the row index,
- the column index.
You can index a single element:
1, 1] us_contagious_diseases[
[1] Hepatitis A
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
1, 2] us_contagious_diseases[
[1] Alabama
51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming
Or a full row:
1, ] us_contagious_diseases[
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
3000, ] us_contagious_diseases[
disease state year weeks_reporting count population
3000 Measles District Of Columbia 1981 27 2 631010
Your turn:
How would you index the year column?
Filtering rows
You can also filter data points based on their values:
|>
us_contagious_diseases filter(state == "California") |>
count()
n
1 315
Your turn:
How many data points are there for the state of Arizona?
|>
us_contagious_diseases filter(state == "California" & year > 2000)
disease state year weeks_reporting count population
1 Hepatitis A California 2001 40 1599 34199784
2 Hepatitis A California 2002 49 1364 34529758
3 Hepatitis A California 2003 46 1045 34861711
4 Hepatitis A California 2004 48 788 35195792
5 Hepatitis A California 2005 49 905 35532154
6 Hepatitis A California 2006 52 688 35870957
7 Hepatitis A California 2007 51 312 36212364
8 Hepatitis A California 2008 52 337 36556548
9 Hepatitis A California 2009 52 239 36903684
10 Hepatitis A California 2010 49 201 37253956
11 Hepatitis A California 2011 49 176 37607525
12 Measles California 2001 40 34 34199784
13 Measles California 2002 33 0 34529758
14 Mumps California 2001 49 37 34199784
15 Mumps California 2002 49 66 34529758
16 Pertussis California 2001 40 440 34199784
17 Pertussis California 2002 43 698 34529758
18 Pertussis California 2003 41 635 34861711
19 Pertussis California 2004 36 498 35195792
20 Pertussis California 2005 45 1609 35532154
21 Pertussis California 2006 42 831 35870957
22 Pertussis California 2007 29 95 36212364
23 Pertussis California 2008 39 276 36556548
24 Pertussis California 2009 40 415 36903684
25 Pertussis California 2010 48 1265 37253956
26 Pertussis California 2011 49 1145 37607525
27 Rubella California 2001 1 0 34199784
28 Rubella California 2002 29 2 34529758
|>
us_contagious_diseases filter(state == "California" & year > 2000) |>
arrange(year)
disease state year weeks_reporting count population
1 Hepatitis A California 2001 40 1599 34199784
2 Measles California 2001 40 34 34199784
3 Mumps California 2001 49 37 34199784
4 Pertussis California 2001 40 440 34199784
5 Rubella California 2001 1 0 34199784
6 Hepatitis A California 2002 49 1364 34529758
7 Measles California 2002 33 0 34529758
8 Mumps California 2002 49 66 34529758
9 Pertussis California 2002 43 698 34529758
10 Rubella California 2002 29 2 34529758
11 Hepatitis A California 2003 46 1045 34861711
12 Pertussis California 2003 41 635 34861711
13 Hepatitis A California 2004 48 788 35195792
14 Pertussis California 2004 36 498 35195792
15 Hepatitis A California 2005 49 905 35532154
16 Pertussis California 2005 45 1609 35532154
17 Hepatitis A California 2006 52 688 35870957
18 Pertussis California 2006 42 831 35870957
19 Hepatitis A California 2007 51 312 36212364
20 Pertussis California 2007 29 95 36212364
21 Hepatitis A California 2008 52 337 36556548
22 Pertussis California 2008 39 276 36556548
23 Hepatitis A California 2009 52 239 36903684
24 Pertussis California 2009 40 415 36903684
25 Hepatitis A California 2010 49 201 37253956
26 Pertussis California 2010 48 1265 37253956
27 Hepatitis A California 2011 49 176 37607525
28 Pertussis California 2011 49 1145 37607525
|>
us_contagious_diseases filter(state == "California" & year > 2000) |>
arrange(count)
disease state year weeks_reporting count population
1 Measles California 2002 33 0 34529758
2 Rubella California 2001 1 0 34199784
3 Rubella California 2002 29 2 34529758
4 Measles California 2001 40 34 34199784
5 Mumps California 2001 49 37 34199784
6 Mumps California 2002 49 66 34529758
7 Pertussis California 2007 29 95 36212364
8 Hepatitis A California 2011 49 176 37607525
9 Hepatitis A California 2010 49 201 37253956
10 Hepatitis A California 2009 52 239 36903684
11 Pertussis California 2008 39 276 36556548
12 Hepatitis A California 2007 51 312 36212364
13 Hepatitis A California 2008 52 337 36556548
14 Pertussis California 2009 40 415 36903684
15 Pertussis California 2001 40 440 34199784
16 Pertussis California 2004 36 498 35195792
17 Pertussis California 2003 41 635 34861711
18 Hepatitis A California 2006 52 688 35870957
19 Pertussis California 2002 43 698 34529758
20 Hepatitis A California 2004 48 788 35195792
21 Pertussis California 2006 42 831 35870957
22 Hepatitis A California 2005 49 905 35532154
23 Hepatitis A California 2003 46 1045 34861711
24 Pertussis California 2011 49 1145 37607525
25 Pertussis California 2010 48 1265 37253956
26 Hepatitis A California 2002 49 1364 34529758
27 Hepatitis A California 2001 40 1599 34199784
28 Pertussis California 2005 45 1609 35532154
|>
us_contagious_diseases filter(state == "California" & year > 2000) |>
arrange(desc(count))
disease state year weeks_reporting count population
1 Pertussis California 2005 45 1609 35532154
2 Hepatitis A California 2001 40 1599 34199784
3 Hepatitis A California 2002 49 1364 34529758
4 Pertussis California 2010 48 1265 37253956
5 Pertussis California 2011 49 1145 37607525
6 Hepatitis A California 2003 46 1045 34861711
7 Hepatitis A California 2005 49 905 35532154
8 Pertussis California 2006 42 831 35870957
9 Hepatitis A California 2004 48 788 35195792
10 Pertussis California 2002 43 698 34529758
11 Hepatitis A California 2006 52 688 35870957
12 Pertussis California 2003 41 635 34861711
13 Pertussis California 2004 36 498 35195792
14 Pertussis California 2001 40 440 34199784
15 Pertussis California 2009 40 415 36903684
16 Hepatitis A California 2008 52 337 36556548
17 Hepatitis A California 2007 51 312 36212364
18 Pertussis California 2008 39 276 36556548
19 Hepatitis A California 2009 52 239 36903684
20 Hepatitis A California 2010 49 201 37253956
21 Hepatitis A California 2011 49 176 37607525
22 Pertussis California 2007 29 95 36212364
23 Mumps California 2002 49 66 34529758
24 Mumps California 2001 49 37 34199784
25 Measles California 2001 40 34 34199784
26 Rubella California 2002 29 2 34529758
27 Measles California 2002 33 0 34529758
28 Rubella California 2001 1 0 34199784
Selecting columns
We saw how to index columns from their position. It is also possible to select them based on their names:
head(us_contagious_diseases$year, 50)
[1] 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
[16] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
[31] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[46] 2011 1966 1967 1968 1969
If you want to select several columns, you can use the select()
function from dplyr
:
|>
us_contagious_diseases filter(state == "California" & year > 2000 & disease == "Hepatitis A") |>
select(year, count, population)
year count population
1 2001 1599 34199784
2 2002 1364 34529758
3 2003 1045 34861711
4 2004 788 35195792
5 2005 905 35532154
6 2006 688 35870957
7 2007 312 36212364
8 2008 337 36556548
9 2009 239 36903684
10 2010 201 37253956
11 2011 176 37607525
Grouping data
It is often useful to group data by categories to compute some summary statistics.
For instance, we can group by year and calculate the total numbers of infections:
|>
us_contagious_diseases group_by(year) |>
summarise(total = sum(count))
# A tibble: 84 × 2
year total
<dbl> <dbl>
1 1928 524563
2 1929 380196
3 1930 439289
4 1931 482886
5 1932 404683
6 1933 391485
7 1934 739509
8 1935 739224
9 1936 292530
10 1937 314425
# ℹ 74 more rows
Alternatively, we can group by state and get the totals:
|>
us_contagious_diseases group_by(state) |>
summarise(total = sum(count))
# A tibble: 51 × 2
state total
<fct> <dbl>
1 Alabama 257979
2 Alaska 29136
3 Arizona 240233
4 Arkansas 177556
5 California 1906067
6 Colorado 322845
7 Connecticut 463148
8 Delaware 44427
9 District Of Columbia 77012
10 Florida 268383
# ℹ 41 more rows
We can also group by year and state and get the totals:
|>
us_contagious_diseases group_by(year, state) |>
summarise(total = sum(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 4,284 × 3
# Groups: year [84]
year state total
<dbl> <fct> <dbl>
1 1928 Alabama 9246
2 1928 Alaska 0
3 1928 Arizona 1268
4 1928 Arkansas 9157
5 1928 California 4960
6 1928 Colorado 2510
7 1928 Connecticut 10247
8 1928 Delaware 607
9 1928 District Of Columbia 2609
10 1928 Florida 1892
# ℹ 4,274 more rows