library(dslabs)
Data exploration
An important first step of data analysis is to have a look at the data. In this section, we will explore the us_contagious_diseases
dataset from the dslabs package.
Load the dslabs package
This package contains a number of datasets. To access any of them, we first need to load the package:
library()
is a function:
class(library)
[1] "function"
Functions are the “verbs” of programming languages. They do things.
library()
is a function that loads packages into the current session so that their content becomes available.
dslabs
is the argument that we pass to the function library()
: it is this particular packages that we are loading in the session here.
class()
is also a function: it tells what class an object belongs to. In class(library)
, library
is the argument of the function class()
.
Printing data to screen
To print all the data, we would simply run us_contagious_diseases
. There are a lot of rows however, so we only want to print a subset to the screen.
To print the first six rows, we use the function head()
, using our data as the argument:
head(us_contagious_diseases)
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
If you look at the documentation of the head()
function (by running ?head
), you can see that it accepts another argument that allows us to set the number of rows to print.
Let’s print the first 15 rows:
head(us_contagious_diseases, n = 15)
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
7 Hepatitis A Alabama 1972 45 342 3524543
8 Hepatitis A Alabama 1973 45 467 3571209
9 Hepatitis A Alabama 1974 45 244 3620548
10 Hepatitis A Alabama 1975 46 286 3671246
11 Hepatitis A Alabama 1976 50 220 3721914
12 Hepatitis A Alabama 1977 43 206 3771085
13 Hepatitis A Alabama 1978 41 203 3817217
14 Hepatitis A Alabama 1979 47 257 3858703
15 Hepatitis A Alabama 1980 37 200 3893888
By default, n = 6
which is why head()
prints six rows unless we specify otherwise. The L
in the documentation of the print()
function (n = 6L
) means that 6
is an integer. You can ignore this for now.
Arguments can be passed to functions as positional arguments (then they have to respect the position of the function definition) or as named arguments (in that case, you need to use the arguments names).
That means that iff we keep the arguments in the right order, we can omit the name of the argument (n
here) and only write its value (15
). :
head(us_contagious_diseases, 15)
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
7 Hepatitis A Alabama 1972 45 342 3524543
8 Hepatitis A Alabama 1973 45 467 3571209
9 Hepatitis A Alabama 1974 45 244 3620548
10 Hepatitis A Alabama 1975 46 286 3671246
11 Hepatitis A Alabama 1976 50 220 3721914
12 Hepatitis A Alabama 1977 43 206 3771085
13 Hepatitis A Alabama 1978 41 203 3817217
14 Hepatitis A Alabama 1979 47 257 3858703
15 Hepatitis A Alabama 1980 37 200 3893888
If the arguments are given to the function out of order however, we do need to use their names.
This won’t work because R needs an integer for n
or for the 2nd argument:
head(15, us_contagious_diseases)
Error in head.default(15, us_contagious_diseases): invalid 'n' - must be numeric, possibly NA.
This however works:
head(n = 15, us_contagious_diseases)
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
7 Hepatitis A Alabama 1972 45 342 3524543
8 Hepatitis A Alabama 1973 45 467 3571209
9 Hepatitis A Alabama 1974 45 244 3620548
10 Hepatitis A Alabama 1975 46 286 3671246
11 Hepatitis A Alabama 1976 50 220 3721914
12 Hepatitis A Alabama 1977 43 206 3771085
13 Hepatitis A Alabama 1978 41 203 3817217
14 Hepatitis A Alabama 1979 47 257 3858703
15 Hepatitis A Alabama 1980 37 200 3893888
We can also print the last 6 rows of the data:
tail(us_contagious_diseases)
disease state year weeks_reporting count population
16060 Smallpox Wyoming 1947 49 1 276297
16061 Smallpox Wyoming 1948 24 1 280803
16062 Smallpox Wyoming 1949 0 0 285544
16063 Smallpox Wyoming 1950 1 2 290529
16064 Smallpox Wyoming 1951 1 1 295744
16065 Smallpox Wyoming 1952 1 1 301083
Your turn:
How would you print the last 10 rows of the data?
Structure of the data object
us_contagious_diseases
is an R object containing the dataset, but what kind of object is it?
class(us_contagious_diseases)
[1] "data.frame"
Our data is in a class of R object called a data frame.
We can get its full structure with:
str(us_contagious_diseases)
'data.frame': 16065 obs. of 6 variables:
$ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
$ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : num 1966 1967 1968 1969 1970 ...
$ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
$ count : num 321 291 314 380 413 378 342 467 244 286 ...
$ population : num 3345787 3364130 3386068 3412450 3444165 ...
The names of the variables can be obtained with:
names(us_contagious_diseases)
[1] "disease" "state" "year" "weeks_reporting"
[5] "count" "population"
You can display the data frame in a tabular fashion thanks to:
View(us_contagious_diseases)
Dimensions of our data frame
dim(us_contagious_diseases)
[1] 16065 6
ncol(us_contagious_diseases)
[1] 6
nrow(us_contagious_diseases)
[1] 16065
length(us_contagious_diseases)
[1] 6
length(us_contagious_diseases$disease)
[1] 16065
Summary statistics
summary(us_contagious_diseases)
disease state year weeks_reporting
Hepatitis A:2346 Alabama : 315 Min. :1928 Min. : 0.00
Measles :3825 Alaska : 315 1st Qu.:1950 1st Qu.:31.00
Mumps :1785 Arizona : 315 Median :1975 Median :46.00
Pertussis :2856 Arkansas : 315 Mean :1971 Mean :37.38
Polio :2091 California: 315 3rd Qu.:1990 3rd Qu.:50.00
Rubella :1887 Colorado : 315 Max. :2011 Max. :52.00
Smallpox :1275 (Other) :14175
count population
Min. : 0 Min. : 86853
1st Qu.: 7 1st Qu.: 1018755
Median : 69 Median : 2749249
Mean : 1492 Mean : 4107584
3rd Qu.: 525 3rd Qu.: 4996229
Max. :132342 Max. :37607525
NA's :214