Data import and export

Author

Marie-Hélène Burle

So far, we have used a well-formatted dataset. In the real world, things are often not this nice and tidy…

In this section, we will learn how to handle real data.

Reading in data

The readr package from the tidyverse provides a number of functions to read in text files with tabular data (e.g. comma-separated values (CSV) or tab-separated values (TSV) files).

Let’s load it:

library(readr)

The read_csv() function allows to read in CSV files that are either stored locally or from a URL.

Let’s use it to load a CSV file with mock archaeological data which is at the URL https://mint.westdri.ca/r/hss_data/arc1.csv:

arc1 <- read_csv("https://mint.westdri.ca/r/hss_data/arc1.csv")

Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Site, Date, Number of artifacts, Name of PI, Comments

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If the file was in your machine, you would provide its path instead of the URL.

Here is our data:

arc1

# A tibble: 6 × 5
  Site  Date      `Number of artifacts` `Name of PI`
  <chr> <chr>     <chr>                 <chr>       
1 E1    13/2/2001 4                     John Doe    
2 E1    14/2/2001 3                     John Doe    
3 A2    26/3/2003 N/A                   Paul Smith  
4 B18   4/5/2006  5                     Paul Smith  
5 B7    4/5/2006  5                     n/a         
6 B3    4/5/2006  5                     P. Smith    
  Comments                          
  <chr>                             
1 <NA>                              
2 <NA>                              
3 Artifacts still need to be counted
4 <NA>                              
5 <NA>                              
6 <NA>

Improper NA

In R, missing values are represented by NA (not available). It is a constant that R understands and can deal with, so it is important that all missing values are represented properly.

When you enter data (say in an Excel file or CSV file), leave an empty cell for missing values: R will then transform them automatically into NA.

Because this data was not entered properly, we have to fix our missing values. One way to go about this is to replace the characters representing missing values in the file ("N/A" and "n/a") by NA:

is.na(arc1) <- arc1 == "N/A"
is.na(arc1) <- arc1 == "n/a"
arc1

# A tibble: 6 × 5
  Site  Date      `Number of artifacts` `Name of PI`
  <chr> <chr>     <chr>                 <chr>       
1 E1    13/2/2001 4                     John Doe    
2 E1    14/2/2001 3                     John Doe    
3 A2    26/3/2003 <NA>                  Paul Smith  
4 B18   4/5/2006  5                     Paul Smith  
5 B7    4/5/2006  5                     <NA>        
6 B3    4/5/2006  5                     P. Smith    
  Comments                          
  <chr>                             
1 <NA>                              
2 <NA>                              
3 Artifacts still need to be counted
4 <NA>                              
5 <NA>                              
6 <NA>

Now, we have another problem to fix: readr is very good at guessing the types of the various variables. Unfortunately, the character "N/A" in the Number of artifacts column prevented it to guess the type properly: it should be a double (a numerical value) and not a character. We can fix this too:

arc1$`Number of artifacts` <- as.double(arc1$`Number of artifacts`)
arc1

# A tibble: 6 × 5
  Site  Date      `Number of artifacts` `Name of PI`
  <chr> <chr>                     <dbl> <chr>       
1 E1    13/2/2001                     4 John Doe    
2 E1    14/2/2001                     3 John Doe    
3 A2    26/3/2003                    NA Paul Smith  
4 B18   4/5/2006                      5 Paul Smith  
5 B7    4/5/2006                      5 <NA>        
6 B3    4/5/2006                      5 P. Smith    
  Comments                          
  <chr>                             
1 <NA>                              
2 <NA>                              
3 Artifacts still need to be counted
4 <NA>                              
5 <NA>                              
6 <NA>

Alternatively, it is simpler to have read_csv() properly recognize the missing values. This can be done thanks to the na argument:

arc1 <- read_csv(
  "https://mint.westdri.ca/r/hss_data/arc1.csv",
  na = c("N/A", "n/a")
)

Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Site, Date, Name of PI, Comments
dbl (1): Number of artifacts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

arc1

# A tibble: 6 × 5
  Site  Date      `Number of artifacts` `Name of PI`
  <chr> <chr>                     <dbl> <chr>       
1 E1    13/2/2001                     4 John Doe    
2 E1    14/2/2001                     3 John Doe    
3 A2    26/3/2003                    NA Paul Smith  
4 B18   4/5/2006                      5 Paul Smith  
5 B7    4/5/2006                      5 <NA>        
6 B3    4/5/2006                      5 P. Smith    
  Comments                            
  <chr>                               
1 ""                                  
2 ""                                  
3 "Artifacts still need to be counted"
4 ""                                  
5 ""                                  
6 ""

A benefit of this approach is that read_csv() now automatically detects the proper data type of Number of artifacts (since there is no more confusing character in what is otherwise a column of doubles).

Dealing with dates

There is another problem in our data frame: the Date variable should be of the date type, but read_csv() failed to recognize the values as dates and processed them as characters. This is because it is not entered in our data following the ISO 8601 format which is YYYY-MM-DD. When you enter data, make sure to follow this format as it will make things work automatically. In our case, we have to convert the date.

The tidyverse package dealing with date is lubridate. Let’s load it:

library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

lubridate comes with many functions that can convert dates and times from many format to the ISO format. Since our date have the day, then the month, then the year, the function we need is dmy():

arc1$Date <- dmy(arc1$Date)

Alternatively, read_csv() will understand dates in a non ISO format, provided you give it the right information. This can be done with the col_types argument and the col_date() function to which the parameters corresponding to your date format are passed.

Here are the parameters to use:

	Format	Example	Parameter
Year	4 digits	2024	`%Y`
	2 digits	24	`%y`
Month	Decimal	2	`%m`
	Abbreviated name	Feb	`%b`
	Full name	February	`%B`
Day	Decimal	8	`%d`

In our case, the date looks like "%d/%m/%Y":

arc1 <- read_csv(
  "https://mint.westdri.ca/r/hss_data/arc1.csv",
  na = c("N/A", "n/a"),
  col_types = cols(Date = col_date("%d/%m/%Y"))
)
arc1

# A tibble: 6 × 5
  Site  Date       `Number of artifacts` `Name of PI`
  <chr> <date>                     <dbl> <chr>       
1 E1    2001-02-13                     4 John Doe    
2 E1    2001-02-14                     3 John Doe    
3 A2    2003-03-26                    NA Paul Smith  
4 B18   2006-05-04                     5 Paul Smith  
5 B7    2006-05-04                     5 <NA>        
6 B3    2006-05-04                     5 P. Smith    
  Comments                            
  <chr>                               
1 ""                                  
2 ""                                  
3 "Artifacts still need to be counted"
4 ""                                  
5 ""                                  
6 ""

Renaming variables

Variable names cannot contain spaces. Since our data did have spaces in some of the names and since those names were not quoted, R added backticks ``` to be able to make use of them. This makes for rather awkward variables. Let’s rename them.

We could use the camel or snake case, but we can also just simplify the names:

arc1 <- arc1 |>
  rename(
    Artifacts = `Number of artifacts`,
    PI = `Name of PI`
  )

Error in rename(arc1, Artifacts = `Number of artifacts`, PI = `Name of PI`): could not find function "rename"

Fixing inconsistencies

There is still another problem in our data: Paul Smith and P. Smith are—as far as R is concerned—2 different values. The number of PIs in our data should be two, but R currently interprets it as being three:

dplyr::n_distinct(arc1$PI, na.rm = TRUE)

Warning: Unknown or uninitialised column: `PI`.

[1] 0

We remove the missing values so that they don’t get counted as an additional PI (although, more PIs could have been involved in the data collection: dealing with missing values programmatically is easy once they are properly formatted, but what to do with them methodologically depends on the situation and is part of the research question).

This can be a problem for future analysis, so let’s fix it. There are many ways to go about this, but the simplest is to use regular expressions:

arc1$PI <- gsub("P\\.", "Paul", arc1$PI)

Warning: Unknown or uninitialised column: `PI`.

Error in `$<-`:
! Assigned data `gsub("P\\\\.", "Paul", arc1$PI)` must be compatible
  with existing data.
✖ Existing data has 6 rows.
✖ Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 0 to size 6.

Our data is finally well formatted and can be used for plotting, analyses, etc.:

arc1

# A tibble: 6 × 5
  Site  Date       `Number of artifacts` `Name of PI`
  <chr> <date>                     <dbl> <chr>       
1 E1    2001-02-13                     4 John Doe    
2 E1    2001-02-14                     3 John Doe    
3 A2    2003-03-26                    NA Paul Smith  
4 B18   2006-05-04                     5 Paul Smith  
5 B7    2006-05-04                     5 <NA>        
6 B3    2006-05-04                     5 P. Smith    
  Comments                            
  <chr>                               
1 ""                                  
2 ""                                  
3 "Artifacts still need to be counted"
4 ""                                  
5 ""                                  
6 ""

Writing data to file

Now that we have a properly formatted data frame, we could, if we needed to, export it to a new file. readr also has functions to write to text files.

Let’s save our data frame as a new CSV file (make sure to give it a different name from the original file):

write_csv(arc1, "arc1_clean.csv")