library(readr)
Data import and export
So far, we have used a well-formatted dataset. In the real world, things are often not this nice and tidy…
In this section, we will learn how to handle real data.
Reading in data
The readr package from the tidyverse provides a number of functions to read in text files with tabular data (e.g. comma-separated values (CSV) or tab-separated values (TSV) files).
Let’s load it:
The read_csv()
function allows to read in CSV files that are either stored locally or from a URL.
Let’s use it to load a CSV file with mock archaeological data which is at the URL https://mint.westdri.ca/r/hss_data/arc1.csv:
<- read_csv("https://mint.westdri.ca/r/hss_data/arc1.csv") arc1
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Site, Date, Number of artifacts, Name of PI, Comments
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If the file was in your machine, you would provide its path instead of the URL.
Here is our data:
arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <chr> <chr> <chr>
1 E1 13/2/2001 4 John Doe
2 E1 14/2/2001 3 John Doe
3 A2 26/3/2003 N/A Paul Smith
4 B18 4/5/2006 5 Paul Smith
5 B7 4/5/2006 5 n/a
6 B3 4/5/2006 5 P. Smith
Comments
<chr>
1 <NA>
2 <NA>
3 Artifacts still need to be counted
4 <NA>
5 <NA>
6 <NA>
Improper NA
In R, missing values are represented by NA
(not available). It is a constant that R understands and can deal with, so it is important that all missing values are represented properly.
When you enter data (say in an Excel file or CSV file), leave an empty cell for missing values: R will then transform them automatically into NA
.
Because this data was not entered properly, we have to fix our missing values. One way to go about this is to replace the characters representing missing values in the file ("N/A"
and "n/a"
) by NA
:
is.na(arc1) <- arc1 == "N/A"
is.na(arc1) <- arc1 == "n/a"
arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <chr> <chr> <chr>
1 E1 13/2/2001 4 John Doe
2 E1 14/2/2001 3 John Doe
3 A2 26/3/2003 <NA> Paul Smith
4 B18 4/5/2006 5 Paul Smith
5 B7 4/5/2006 5 <NA>
6 B3 4/5/2006 5 P. Smith
Comments
<chr>
1 <NA>
2 <NA>
3 Artifacts still need to be counted
4 <NA>
5 <NA>
6 <NA>
Now, we have another problem to fix: readr
is very good at guessing the types of the various variables. Unfortunately, the character "N/A"
in the Number of artifacts
column prevented it to guess the type properly: it should be a double (a numerical value) and not a character. We can fix this too:
$`Number of artifacts` <- as.double(arc1$`Number of artifacts`)
arc1 arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <chr> <dbl> <chr>
1 E1 13/2/2001 4 John Doe
2 E1 14/2/2001 3 John Doe
3 A2 26/3/2003 NA Paul Smith
4 B18 4/5/2006 5 Paul Smith
5 B7 4/5/2006 5 <NA>
6 B3 4/5/2006 5 P. Smith
Comments
<chr>
1 <NA>
2 <NA>
3 Artifacts still need to be counted
4 <NA>
5 <NA>
6 <NA>
Alternatively, it is simpler to have read_csv()
properly recognize the missing values. This can be done thanks to the na
argument:
<- read_csv(
arc1 "https://mint.westdri.ca/r/hss_data/arc1.csv",
na = c("N/A", "n/a")
)
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Site, Date, Name of PI, Comments
dbl (1): Number of artifacts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <chr> <dbl> <chr>
1 E1 13/2/2001 4 John Doe
2 E1 14/2/2001 3 John Doe
3 A2 26/3/2003 NA Paul Smith
4 B18 4/5/2006 5 Paul Smith
5 B7 4/5/2006 5 <NA>
6 B3 4/5/2006 5 P. Smith
Comments
<chr>
1 ""
2 ""
3 "Artifacts still need to be counted"
4 ""
5 ""
6 ""
A benefit of this approach is that read_csv()
now automatically detects the proper data type of Number of artifacts
(since there is no more confusing character in what is otherwise a column of doubles).
Dealing with dates
There is another problem in our data frame: the Date
variable should be of the date
type, but read_csv()
failed to recognize the values as dates and processed them as characters. This is because it is not entered in our data following the ISO 8601 format which is YYYY-MM-DD. When you enter data, make sure to follow this format as it will make things work automatically. In our case, we have to convert the date.
The tidyverse package dealing with date is lubridate. Let’s load it:
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
lubridate
comes with many functions that can convert dates and times from many format to the ISO format. Since our date have the day, then the month, then the year, the function we need is dmy()
:
$Date <- dmy(arc1$Date) arc1
Alternatively, read_csv()
will understand dates in a non ISO format, provided you give it the right information. This can be done with the col_types
argument and the col_date()
function to which the parameters corresponding to your date format are passed.
Here are the parameters to use:
Format | Example | Parameter | |
---|---|---|---|
Year | 4 digits | 2024 | %Y |
2 digits | 24 | %y |
|
Month | Decimal | 2 | %m |
Abbreviated name | Feb | %b |
|
Full name | February | %B |
|
Day | Decimal | 8 | %d |
In our case, the date looks like "%d/%m/%Y"
:
<- read_csv(
arc1 "https://mint.westdri.ca/r/hss_data/arc1.csv",
na = c("N/A", "n/a"),
col_types = cols(Date = col_date("%d/%m/%Y"))
) arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <date> <dbl> <chr>
1 E1 2001-02-13 4 John Doe
2 E1 2001-02-14 3 John Doe
3 A2 2003-03-26 NA Paul Smith
4 B18 2006-05-04 5 Paul Smith
5 B7 2006-05-04 5 <NA>
6 B3 2006-05-04 5 P. Smith
Comments
<chr>
1 ""
2 ""
3 "Artifacts still need to be counted"
4 ""
5 ""
6 ""
Renaming variables
Variable names cannot contain spaces. Since our data did have spaces in some of the names and since those names were not quoted, R added backticks ``` to be able to make use of them. This makes for rather awkward variables. Let’s rename them.
We could use the camel or snake case, but we can also just simplify the names:
<- arc1 |>
arc1 rename(
Artifacts = `Number of artifacts`,
PI = `Name of PI`
)
Error in rename(arc1, Artifacts = `Number of artifacts`, PI = `Name of PI`): could not find function "rename"
Fixing inconsistencies
There is still another problem in our data: Paul Smith
and P. Smith
are—as far as R is concerned—2 different values. The number of PIs in our data should be two, but R currently interprets it as being three:
::n_distinct(arc1$PI, na.rm = TRUE) dplyr
Warning: Unknown or uninitialised column: `PI`.
[1] 0
We remove the missing values so that they don’t get counted as an additional PI (although, more PIs could have been involved in the data collection: dealing with missing values programmatically is easy once they are properly formatted, but what to do with them methodologically depends on the situation and is part of the research question).
This can be a problem for future analysis, so let’s fix it. There are many ways to go about this, but the simplest is to use regular expressions:
$PI <- gsub("P\\.", "Paul", arc1$PI) arc1
Warning: Unknown or uninitialised column: `PI`.
Error in `$<-`:
! Assigned data `gsub("P\\\\.", "Paul", arc1$PI)` must be compatible
with existing data.
✖ Existing data has 6 rows.
✖ Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 0 to size 6.
Our data is finally well formatted and can be used for plotting, analyses, etc.:
arc1
# A tibble: 6 × 5
Site Date `Number of artifacts` `Name of PI`
<chr> <date> <dbl> <chr>
1 E1 2001-02-13 4 John Doe
2 E1 2001-02-14 3 John Doe
3 A2 2003-03-26 NA Paul Smith
4 B18 2006-05-04 5 Paul Smith
5 B7 2006-05-04 5 <NA>
6 B3 2006-05-04 5 P. Smith
Comments
<chr>
1 ""
2 ""
3 "Artifacts still need to be counted"
4 ""
5 ""
6 ""
Writing data to file
Now that we have a properly formatted data frame, we could, if we needed to, export it to a new file. readr
also has functions to write to text files.
Let’s save our data frame as a new CSV file (make sure to give it a different name from the original file):
write_csv(arc1, "arc1_clean.csv")