Data extraction

Author

Marie-Hélène Burle

It is often useful to focus on sections of the data to plot or analyse. In this section, we will see how to extract various elements of the us_contagious_diseases dataset from the dslabs package.

Load packages

One of the tidyverse packages is very useful for data manipulation: dplyr. Let’s load the dslabs package again as well as dplyr:

library(dslabs)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Indexing

You can extract a subset of the data using their position by indexing. Indexing in R starts with 1 (in many languages, the first index is 0) and it is done with square brackets. Since a data frame has two dimensions, there are two possible indices in the square brackets:

  • the row index,
  • the column index.

You can index a single element:

us_contagious_diseases[1, 1]
[1] Hepatitis A
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
us_contagious_diseases[1, 2]
[1] Alabama
51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming

Or a full row:

us_contagious_diseases[1, ]
      disease   state year weeks_reporting count population
1 Hepatitis A Alabama 1966              50   321    3345787
us_contagious_diseases[3000, ]
     disease                state year weeks_reporting count population
3000 Measles District Of Columbia 1981              27     2     631010

Your turn:

How would you index the year column?

Filtering rows

You can also filter data points based on their values:

us_contagious_diseases |>
  filter(state == "California") |>
  count()
    n
1 315

Your turn:

How many data points are there for the state of Arizona?

us_contagious_diseases |>
  filter(state == "California" & year > 2000)
       disease      state year weeks_reporting count population
1  Hepatitis A California 2001              40  1599   34199784
2  Hepatitis A California 2002              49  1364   34529758
3  Hepatitis A California 2003              46  1045   34861711
4  Hepatitis A California 2004              48   788   35195792
5  Hepatitis A California 2005              49   905   35532154
6  Hepatitis A California 2006              52   688   35870957
7  Hepatitis A California 2007              51   312   36212364
8  Hepatitis A California 2008              52   337   36556548
9  Hepatitis A California 2009              52   239   36903684
10 Hepatitis A California 2010              49   201   37253956
11 Hepatitis A California 2011              49   176   37607525
12     Measles California 2001              40    34   34199784
13     Measles California 2002              33     0   34529758
14       Mumps California 2001              49    37   34199784
15       Mumps California 2002              49    66   34529758
16   Pertussis California 2001              40   440   34199784
17   Pertussis California 2002              43   698   34529758
18   Pertussis California 2003              41   635   34861711
19   Pertussis California 2004              36   498   35195792
20   Pertussis California 2005              45  1609   35532154
21   Pertussis California 2006              42   831   35870957
22   Pertussis California 2007              29    95   36212364
23   Pertussis California 2008              39   276   36556548
24   Pertussis California 2009              40   415   36903684
25   Pertussis California 2010              48  1265   37253956
26   Pertussis California 2011              49  1145   37607525
27     Rubella California 2001               1     0   34199784
28     Rubella California 2002              29     2   34529758
us_contagious_diseases |>
  filter(state == "California" & year > 2000) |>
  arrange(year)
       disease      state year weeks_reporting count population
1  Hepatitis A California 2001              40  1599   34199784
2      Measles California 2001              40    34   34199784
3        Mumps California 2001              49    37   34199784
4    Pertussis California 2001              40   440   34199784
5      Rubella California 2001               1     0   34199784
6  Hepatitis A California 2002              49  1364   34529758
7      Measles California 2002              33     0   34529758
8        Mumps California 2002              49    66   34529758
9    Pertussis California 2002              43   698   34529758
10     Rubella California 2002              29     2   34529758
11 Hepatitis A California 2003              46  1045   34861711
12   Pertussis California 2003              41   635   34861711
13 Hepatitis A California 2004              48   788   35195792
14   Pertussis California 2004              36   498   35195792
15 Hepatitis A California 2005              49   905   35532154
16   Pertussis California 2005              45  1609   35532154
17 Hepatitis A California 2006              52   688   35870957
18   Pertussis California 2006              42   831   35870957
19 Hepatitis A California 2007              51   312   36212364
20   Pertussis California 2007              29    95   36212364
21 Hepatitis A California 2008              52   337   36556548
22   Pertussis California 2008              39   276   36556548
23 Hepatitis A California 2009              52   239   36903684
24   Pertussis California 2009              40   415   36903684
25 Hepatitis A California 2010              49   201   37253956
26   Pertussis California 2010              48  1265   37253956
27 Hepatitis A California 2011              49   176   37607525
28   Pertussis California 2011              49  1145   37607525
us_contagious_diseases |>
  filter(state == "California" & year > 2000) |>
  arrange(count)
       disease      state year weeks_reporting count population
1      Measles California 2002              33     0   34529758
2      Rubella California 2001               1     0   34199784
3      Rubella California 2002              29     2   34529758
4      Measles California 2001              40    34   34199784
5        Mumps California 2001              49    37   34199784
6        Mumps California 2002              49    66   34529758
7    Pertussis California 2007              29    95   36212364
8  Hepatitis A California 2011              49   176   37607525
9  Hepatitis A California 2010              49   201   37253956
10 Hepatitis A California 2009              52   239   36903684
11   Pertussis California 2008              39   276   36556548
12 Hepatitis A California 2007              51   312   36212364
13 Hepatitis A California 2008              52   337   36556548
14   Pertussis California 2009              40   415   36903684
15   Pertussis California 2001              40   440   34199784
16   Pertussis California 2004              36   498   35195792
17   Pertussis California 2003              41   635   34861711
18 Hepatitis A California 2006              52   688   35870957
19   Pertussis California 2002              43   698   34529758
20 Hepatitis A California 2004              48   788   35195792
21   Pertussis California 2006              42   831   35870957
22 Hepatitis A California 2005              49   905   35532154
23 Hepatitis A California 2003              46  1045   34861711
24   Pertussis California 2011              49  1145   37607525
25   Pertussis California 2010              48  1265   37253956
26 Hepatitis A California 2002              49  1364   34529758
27 Hepatitis A California 2001              40  1599   34199784
28   Pertussis California 2005              45  1609   35532154
us_contagious_diseases |>
  filter(state == "California" & year > 2000) |>
  arrange(desc(count))
       disease      state year weeks_reporting count population
1    Pertussis California 2005              45  1609   35532154
2  Hepatitis A California 2001              40  1599   34199784
3  Hepatitis A California 2002              49  1364   34529758
4    Pertussis California 2010              48  1265   37253956
5    Pertussis California 2011              49  1145   37607525
6  Hepatitis A California 2003              46  1045   34861711
7  Hepatitis A California 2005              49   905   35532154
8    Pertussis California 2006              42   831   35870957
9  Hepatitis A California 2004              48   788   35195792
10   Pertussis California 2002              43   698   34529758
11 Hepatitis A California 2006              52   688   35870957
12   Pertussis California 2003              41   635   34861711
13   Pertussis California 2004              36   498   35195792
14   Pertussis California 2001              40   440   34199784
15   Pertussis California 2009              40   415   36903684
16 Hepatitis A California 2008              52   337   36556548
17 Hepatitis A California 2007              51   312   36212364
18   Pertussis California 2008              39   276   36556548
19 Hepatitis A California 2009              52   239   36903684
20 Hepatitis A California 2010              49   201   37253956
21 Hepatitis A California 2011              49   176   37607525
22   Pertussis California 2007              29    95   36212364
23       Mumps California 2002              49    66   34529758
24       Mumps California 2001              49    37   34199784
25     Measles California 2001              40    34   34199784
26     Rubella California 2002              29     2   34529758
27     Measles California 2002              33     0   34529758
28     Rubella California 2001               1     0   34199784

Selecting columns

We saw how to index columns from their position. It is also possible to select them based on their names:

head(us_contagious_diseases$year, 50)
 [1] 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
[16] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
[31] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[46] 2011 1966 1967 1968 1969

If you want to select several columns, you can use the select() function from dplyr:

us_contagious_diseases |>
  filter(state == "California" & year > 2000 & disease == "Hepatitis A") |>
  select(year, count, population)
   year count population
1  2001  1599   34199784
2  2002  1364   34529758
3  2003  1045   34861711
4  2004   788   35195792
5  2005   905   35532154
6  2006   688   35870957
7  2007   312   36212364
8  2008   337   36556548
9  2009   239   36903684
10 2010   201   37253956
11 2011   176   37607525

Grouping data

It is often useful to group data by categories to compute some summary statistics.

For instance, we can group by year and calculate the total numbers of infections:

us_contagious_diseases |>
  group_by(year) |>
  summarise(total = sum(count))
# A tibble: 84 × 2
    year  total
   <dbl>  <dbl>
 1  1928 524563
 2  1929 380196
 3  1930 439289
 4  1931 482886
 5  1932 404683
 6  1933 391485
 7  1934 739509
 8  1935 739224
 9  1936 292530
10  1937 314425
# ℹ 74 more rows

Alternatively, we can group by state and get the totals:

us_contagious_diseases |>
  group_by(state) |> 
  summarise(total = sum(count))
# A tibble: 51 × 2
   state                  total
   <fct>                  <dbl>
 1 Alabama               257979
 2 Alaska                 29136
 3 Arizona               240233
 4 Arkansas              177556
 5 California           1906067
 6 Colorado              322845
 7 Connecticut           463148
 8 Delaware               44427
 9 District Of Columbia   77012
10 Florida               268383
# ℹ 41 more rows

We can also group by year and state and get the totals:

us_contagious_diseases |>
  group_by(year, state) |> 
  summarise(total = sum(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 4,284 × 3
# Groups:   year [84]
    year state                total
   <dbl> <fct>                <dbl>
 1  1928 Alabama               9246
 2  1928 Alaska                   0
 3  1928 Arizona               1268
 4  1928 Arkansas              9157
 5  1928 California            4960
 6  1928 Colorado              2510
 7  1928 Connecticut          10247
 8  1928 Delaware               607
 9  1928 District Of Columbia  2609
10  1928 Florida               1892
# ℹ 4,274 more rows