Data exploration

Author

Marie-Hélène Burle

An important first step of data analysis is to have a look at the data. In this section, we will explore the us_contagious_diseases dataset from the dslabs package.

Load the dslabs package

This package contains a number of datasets. To access any of them, we first need to load the package:

library(dslabs)

library() is a function:

class(library)
[1] "function"

Functions are the “verbs” of programming languages. They do things.

library() is a function that loads packages into the current session so that their content becomes available.

dslabs is the argument that we pass to the function library(): it is this particular packages that we are loading in the session here.

class() is also a function: it tells what class an object belongs to. In class(library), library is the argument of the function class().

Printing data to screen

To print all the data, we would simply run us_contagious_diseases. There are a lot of rows however, so we only want to print a subset to the screen.

To print the first six rows, we use the function head(), using our data as the argument:

head(us_contagious_diseases)
      disease   state year weeks_reporting count population
1 Hepatitis A Alabama 1966              50   321    3345787
2 Hepatitis A Alabama 1967              49   291    3364130
3 Hepatitis A Alabama 1968              52   314    3386068
4 Hepatitis A Alabama 1969              49   380    3412450
5 Hepatitis A Alabama 1970              51   413    3444165
6 Hepatitis A Alabama 1971              51   378    3481798

If you look at the documentation of the head() function (by running ?head), you can see that it accepts another argument that allows us to set the number of rows to print.

Let’s print the first 15 rows:

head(us_contagious_diseases, n = 15)
       disease   state year weeks_reporting count population
1  Hepatitis A Alabama 1966              50   321    3345787
2  Hepatitis A Alabama 1967              49   291    3364130
3  Hepatitis A Alabama 1968              52   314    3386068
4  Hepatitis A Alabama 1969              49   380    3412450
5  Hepatitis A Alabama 1970              51   413    3444165
6  Hepatitis A Alabama 1971              51   378    3481798
7  Hepatitis A Alabama 1972              45   342    3524543
8  Hepatitis A Alabama 1973              45   467    3571209
9  Hepatitis A Alabama 1974              45   244    3620548
10 Hepatitis A Alabama 1975              46   286    3671246
11 Hepatitis A Alabama 1976              50   220    3721914
12 Hepatitis A Alabama 1977              43   206    3771085
13 Hepatitis A Alabama 1978              41   203    3817217
14 Hepatitis A Alabama 1979              47   257    3858703
15 Hepatitis A Alabama 1980              37   200    3893888

By default, n = 6 which is why head() prints six rows unless we specify otherwise. The L in the documentation of the print() function (n = 6L) means that 6 is an integer. You can ignore this for now.

Arguments can be passed to functions as positional arguments (then they have to respect the position of the function definition) or as named arguments (in that case, you need to use the arguments names).

That means that iff we keep the arguments in the right order, we can omit the name of the argument (n here) and only write its value (15). :

head(us_contagious_diseases, 15)
       disease   state year weeks_reporting count population
1  Hepatitis A Alabama 1966              50   321    3345787
2  Hepatitis A Alabama 1967              49   291    3364130
3  Hepatitis A Alabama 1968              52   314    3386068
4  Hepatitis A Alabama 1969              49   380    3412450
5  Hepatitis A Alabama 1970              51   413    3444165
6  Hepatitis A Alabama 1971              51   378    3481798
7  Hepatitis A Alabama 1972              45   342    3524543
8  Hepatitis A Alabama 1973              45   467    3571209
9  Hepatitis A Alabama 1974              45   244    3620548
10 Hepatitis A Alabama 1975              46   286    3671246
11 Hepatitis A Alabama 1976              50   220    3721914
12 Hepatitis A Alabama 1977              43   206    3771085
13 Hepatitis A Alabama 1978              41   203    3817217
14 Hepatitis A Alabama 1979              47   257    3858703
15 Hepatitis A Alabama 1980              37   200    3893888

If the arguments are given to the function out of order however, we do need to use their names.

This won’t work because R needs an integer for n or for the 2nd argument:

head(15, us_contagious_diseases)
Error in head.default(15, us_contagious_diseases): invalid 'n' - must be numeric, possibly NA.

This however works:

head(n = 15, us_contagious_diseases)
       disease   state year weeks_reporting count population
1  Hepatitis A Alabama 1966              50   321    3345787
2  Hepatitis A Alabama 1967              49   291    3364130
3  Hepatitis A Alabama 1968              52   314    3386068
4  Hepatitis A Alabama 1969              49   380    3412450
5  Hepatitis A Alabama 1970              51   413    3444165
6  Hepatitis A Alabama 1971              51   378    3481798
7  Hepatitis A Alabama 1972              45   342    3524543
8  Hepatitis A Alabama 1973              45   467    3571209
9  Hepatitis A Alabama 1974              45   244    3620548
10 Hepatitis A Alabama 1975              46   286    3671246
11 Hepatitis A Alabama 1976              50   220    3721914
12 Hepatitis A Alabama 1977              43   206    3771085
13 Hepatitis A Alabama 1978              41   203    3817217
14 Hepatitis A Alabama 1979              47   257    3858703
15 Hepatitis A Alabama 1980              37   200    3893888

We can also print the last 6 rows of the data:

tail(us_contagious_diseases)
       disease   state year weeks_reporting count population
16060 Smallpox Wyoming 1947              49     1     276297
16061 Smallpox Wyoming 1948              24     1     280803
16062 Smallpox Wyoming 1949               0     0     285544
16063 Smallpox Wyoming 1950               1     2     290529
16064 Smallpox Wyoming 1951               1     1     295744
16065 Smallpox Wyoming 1952               1     1     301083

Your turn:

How would you print the last 10 rows of the data?

Structure of the data object

us_contagious_diseases is an R object containing the dataset, but what kind of object is it?

class(us_contagious_diseases)
[1] "data.frame"

Our data is in a class of R object called a data frame.

We can get its full structure with:

str(us_contagious_diseases)
'data.frame':   16065 obs. of  6 variables:
 $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year           : num  1966 1967 1968 1969 1970 ...
 $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
 $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
 $ population     : num  3345787 3364130 3386068 3412450 3444165 ...

The names of the variables can be obtained with:

names(us_contagious_diseases)
[1] "disease"         "state"           "year"            "weeks_reporting"
[5] "count"           "population"     

You can display the data frame in a tabular fashion thanks to:

View(us_contagious_diseases)

Dimensions of our data frame

dim(us_contagious_diseases)
[1] 16065     6
ncol(us_contagious_diseases)
[1] 6
nrow(us_contagious_diseases)
[1] 16065
length(us_contagious_diseases)
[1] 6
length(us_contagious_diseases$disease)
[1] 16065

Summary statistics

summary(us_contagious_diseases)
        disease            state            year      weeks_reporting
 Hepatitis A:2346   Alabama   :  315   Min.   :1928   Min.   : 0.00  
 Measles    :3825   Alaska    :  315   1st Qu.:1950   1st Qu.:31.00  
 Mumps      :1785   Arizona   :  315   Median :1975   Median :46.00  
 Pertussis  :2856   Arkansas  :  315   Mean   :1971   Mean   :37.38  
 Polio      :2091   California:  315   3rd Qu.:1990   3rd Qu.:50.00  
 Rubella    :1887   Colorado  :  315   Max.   :2011   Max.   :52.00  
 Smallpox   :1275   (Other)   :14175                                 
     count          population      
 Min.   :     0   Min.   :   86853  
 1st Qu.:     7   1st Qu.: 1018755  
 Median :    69   Median : 2749249  
 Mean   :  1492   Mean   : 4107584  
 3rd Qu.:   525   3rd Qu.: 4996229  
 Max.   :132342   Max.   :37607525  
                  NA's   :214