A little demo of programming in

noshadow

Marie-Hélène Burle

February 15, 2024


A few words about R

History

Created by academic statisticians Ross Ihaka and Robert Gentleman

The name comes from the language S which was a great influence as well as the first initial of the developers

Launched in 1993

A GNU Project since 1997

Why R?

Free and open source

High-level and easy to learn

Large community

Very well documented

Unequalled number of statistics and modelling packages

Integrated package manager

Easy connection with fast compiled languages such as C and C++

Powerful IDEs (e.g. RStudio, ESS, Jupyter)

For whom?

Fields with heavy statistics, modelling, or Bayesian inference such as biology, linguistics, economics, or statistics

Data science

Downsides

Inconsistent syntax full of quirks

Slow

Large memory usage

Running R

An interpreted language

R being an interpreted language, it can be run non-interactively or interactively

Running R non-interactively

If you write code in a text file (called a script), you can then execute it with:

Rscript my_script.R

The command to execute scripts is Rscript rather than R
By convention, R scripts take the extension .R

Running R interactively

There are several ways to run R interactively:

  • directly in the console (the name for the R shell)
  • in Jupyter with the R kernel (IRkernel package)
  • in another IDE (e.g. in Emacs with ESS)
  • in the RStudio IDE

The R console

noshadow

RStudio

Posit (formerly RStudio Inc.) developed a great and very popular IDE called RStudio

Here is its cheatsheet (click on it to download it):

A few basics

Documentation

The R documentation is excellent. Get info on any function with ? (e.g. ?sum)

Basic operations

a <- 5
4 + a
[1] 9
c <- c(2, 4, 1)
c * 5
[1] 10 20  5
sum(c)
[1] 7

Statistics, probabilities, and modelling

R really shines when it comes to statistics and modelling

We will spend the rest of the hour diving into very complex and heavy Bayesian statistics

Just kidding 🙂

In this demo, I will stick to fun topics

Data visualization

Datasets

R comes with a number of datasets. You can get a list by running data()

Datasets

The ggplot2 package provides additional ones, such as the mpg dataset:

library(ggplot2)
head(mpg)
# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl   
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr>
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p    
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p    
3 audi         a4      2    2008     4 manual(m6) f        20    31 p    
4 audi         a4      2    2008     4 auto(av)   f        21    30 p    
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p    
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p    
  class  
  <chr>  
1 compact
2 compact
3 compact
4 compact
5 compact
6 compact

The canvas


The first component is the data:

ggplot(data = mpg)

The canvas

The second component sets the way variables are mapped on the axes. This is done with the aes() (aesthetics) function:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy))

Geometric representations of the data

Onto this canvas, we can add “geoms” (geometrical objects) representing the data.
To represent the data as a scatterplot, we use the geom_point() function:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Colour-coding based on variables

We can colour-code the points in the scatterplot based on the drv variable, showing the lower fuel efficiency of 4WD vehicles:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = drv))

Colour-coding based on variables

Or we can colour-code them based on the class variable:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class))

Multiple geoms

Multiple “geoms” can be added on top of each other. For instance, we can add a smoothed conditional means function that aids at seeing patterns in the data with geom_smooth():

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()

The default smoothing function uses the LOESS (locally estimated scatterplot smoothing) method. We can change the method by passing it as an argument to geom_smooth():

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(method = lm)

We can apply the smoothing function to each class instead of the entire data. It creates a busy plot but shows that the downward trend remains true within each type of car:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(aes(color = class)) +
  geom_smooth(method = lm)

We can remove the standard errors and customize the line for our linear model:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  )

Colour scales



Let’s try the Dark2 palette:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  )


We can add title, axes labels, captions…

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  ) +
  labs(
    title = "Fuel consumption per engine size on highways",
    x = "Engine size (L)",
    y = "Fuel economy (mpg) on highways",
    color = "Type of car",
    caption = "EPA data from https://fueleconomy.gov/"
  )

Let’s change the theme to remove all this background noise:

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  ) +
  labs(
    title = "Fuel consumption per engine size on highways",
    x = "Engine size (L)",
    y = "Fuel economy (mpg) on highways",
    color = "Type of car",
    caption = "EPA data from https://fueleconomy.gov/"
  ) +
  theme_classic()

The theme() function allows to tweak the theme in any number of ways. For instance, what if we don’t like the default position of the title and we’d rather have it centered?

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  ) +
  labs(
    title = "Fuel consumption per engine size on highways",
    x = "Engine size (L)",
    y = "Fuel economy (mpg) on highways",
    color = "Type of car",
    caption = "EPA data from https://fueleconomy.gov/"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Many things can be changed thanks to the theme() function. For instance, we can move the legend to give more space to the actual graph:

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  ) +
  labs(
    title = "Fuel consumption per engine size on highways",
    x = "Engine size (L)",
    y = "Fuel economy (mpg) on highways",
    color = "Type of car",
    caption = "EPA data from https://fueleconomy.gov/"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom")

ggplot extensions

Many packages build on ggplot2 and add functionality

Combining plots

One ggplot extension is the patchwork package which allows to combine multiple plots on the same frame

Let’s add a second plot next to our plot (we also make a few changes to the labels to improve the plots integration):

library(patchwork)

ggplot(mpg, aes(x = displ, y = hwy)) +        # First plot
  geom_point(aes(color = class)) +
  scale_color_brewer(palette = "Dark2") +
  geom_smooth(
    method = lm,
    se = FALSE,
    color = "#999999",
    linewidth = 0.5
  ) +
  labs(
    x = "Engine size (L)",
    y = "Fuel economy (mpg) on highways",
    color = "Type of car"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = c(0.7, 0.75),           # Better legend position
    legend.background = element_rect(         # Add a frame to the legend
      linewidth = 0.1,
      linetype = "solid",
      colour = "black"
    )
  ) +
  ggplot(mpg, aes(x = displ, y = hwy)) +      # Second plot
  geom_point(aes(color = drv)) +
  scale_color_brewer(palette = "Dark2") +
  labs(
    x = "Engine size (L)",
    y = element_blank(),                      # Remove redundant label
    color = "Type of drive train",
    caption = "EPA data from https://fueleconomy.gov/"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = c(0.7, 0.87),
    legend.background = element_rect(
      linewidth = 0.1,
      linetype = "solid",
      colour = "black"
    )
  )

Web scraping

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

  • <h2>This is a heading of level 2</h2>
  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Example for this workshop

We will use a website from the University of Tennessee containing a database of PhD theses from that university

Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation

We will only do this for the first page which contains the links to the 100 most recent theses. If you really wanted to gather all the data, you would have to do this for all pages

Package

To do all this, we will use the package rvest, part of the tidyverse (a modern set of R packages). It is a package influenced by the popular Python package Beautiful Soup and it makes scraping websites with R really easy

Let’s load it:

library(rvest)

Read in HTML from main site

As mentioned above, our site is the database of PhD dissertations from the University of Tennessee

Let’s create a character vector with the URL:

url <- "https://trace.tennessee.edu/utk_graddiss/index.html"

First, we read in the html data from that page:

html <- read_html(url)

Let’s have a look at the raw data:

html
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<!-- FILE /srv/sequoia/main/data/trace.tennessee.edu/assets/heade ...

Extract all URLs

dat <- html %>% html_elements(".article-listing a")
dat
{xml_nodeset (100)}
 [1] <a href="https://trace.tennessee.edu/utk_graddiss/8076">Understanding ho ...
 [2] <a href="https://trace.tennessee.edu/utk_graddiss/9158">Generating Diver ...
 [3] <a href="https://trace.tennessee.edu/utk_graddiss/8080">FABRICATION, MEA ...
 [4] <a href="https://trace.tennessee.edu/utk_graddiss/8086">Development and  ...
 [5] <a href="https://trace.tennessee.edu/utk_graddiss/8078">The Light from P ...
 [6] <a href="https://trace.tennessee.edu/utk_graddiss/9185">Image Deblurring ...
 [7] <a href="https://trace.tennessee.edu/utk_graddiss/8584">Clickable Lipid  ...
 [8] <a href="https://trace.tennessee.edu/utk_graddiss/8703">Retinoic Acid, I ...
 [9] <a href="https://trace.tennessee.edu/utk_graddiss/8987">Development and  ...
[10] <a href="https://trace.tennessee.edu/utk_graddiss/8734">Defining Systemi ...
[11] <a href="https://trace.tennessee.edu/utk_graddiss/8073">Investigating th ...
[12] <a href="https://trace.tennessee.edu/utk_graddiss/8088">The Disparate Ef ...
[13] <a href="https://trace.tennessee.edu/utk_graddiss/9077">Social Wellness  ...
[14] <a href="https://trace.tennessee.edu/utk_graddiss/8077">A HIERARCHICAL P ...
[15] <a href="https://trace.tennessee.edu/utk_graddiss/8094">Nurse Staffing a ...
[16] <a href="https://trace.tennessee.edu/utk_graddiss/8714">Ruinous Natures: ...
[17] <a href="https://trace.tennessee.edu/utk_graddiss/9050">Toward Accelerat ...
[18] <a href="https://trace.tennessee.edu/utk_graddiss/8737">Implementing the ...
[19] <a href="https://trace.tennessee.edu/utk_graddiss/8074">Riding the Wave: ...
[20] <a href="https://trace.tennessee.edu/utk_graddiss/9177">A TWO-DIAMETER H ...
...

Extract all URLs

We now have a list of lists

Before running for loops, it is important to initialize empty loops. It is much more efficient than growing the result at each iteration

So let’s initialize an empty list that we call list_urls of the appropriate size:

list_urls <- vector("list", length(dat))

Extract all URLs

Now we can run a loop to fill in our list:

for (i in seq_along(dat)) {
  list_urls[[i]] <- dat[[i]] %>% html_attr("href")
}

Let’s print again the first element of list_urls to make sure all looks good:

list_urls[[1]]
[1] "https://trace.tennessee.edu/utk_graddiss/8076"

We now have a list of URLs (in the form of character vectors) as we wanted

Extract data from each page

We will now extract the data (date, major, and advisor) for all URLs in our list.

Again, before running a for loop, we need to allocate memory first by creating an empty container (here a list):

list_data <- vector("list", length(list_urls))

for (i in seq_along(list_urls)) {
  html <- read_html(list_urls[[i]])
  date <- html %>%
    html_element("#publication_date p") %>%
    html_text2()
  major <- html %>%
    html_element("#department p") %>%
    html_text2()
  advisor <- html %>%
    html_element("#advisor1 p") %>%
    html_text2()
  Sys.sleep(0.1)  # Add a little delay
  list_data[[i]] <- cbind(date, major, advisor)
}

Store results in DataFrame

We can turn this big list into a dataframe:

result <- do.call(rbind.data.frame, list_data)

We can capitalize the headers:

names(result) <- c("Date", "Major", "Advisor")

Our final data

result is a long dataframe, so we will only print the first few elements:

head(result, 15)
      Date                          Major               Advisor
1   5-2023                  Life Sciences       Bode A. Olukolu
2  12-2023         Industrial Engineering            Hugh Medal
3   5-2023            Nuclear Engineering           Erik Lukosi
4   5-2023 Energy Science and Engineering   Kyle R. Gluesenkamp
5   5-2023                        English Margaret Lazarus Dean
6  12-2023         Industrial Engineering          Hoon Hwangbo
7   8-2023                      Chemistry       Michael D. Best
8   8-2023           Nutritional Sciences         Jiangang Chen
9  12-2023         Mechanical Engineering      Dustin L. Crouch
10  8-2023            Counselor Education    Melinda M. Gibbons
11  5-2023         Mechanical Engineering            Doug Aaron
12  5-2023        Business Administration           Linda Myers
13 12-2023            Counselor Education   Joel Foster Diambra
14  5-2023         Industrial Engineering         John E. Kobza
15  5-2023                        Nursing       Carole R. Myers

Save results to file

If we wanted, we could save our data to a CSV file:

write.csv(result, "dissertations_data.csv", row.names = FALSE)

Data reading and manipulation

  • Spatial vectors: great modern packages are sf or terra
  • Raster data: the package terra

I will skip the data preparation due to lack of time, but you can look at the code in this webinar or this workshop

Mapping data

Good options to create maps include ggplot2 (the package we already used for plotting) or tmap

Map of glaciers in western North America

tm_shape(states, bbox = nwa_bbox) +
  tm_polygons(col = "#f2f2f2", lwd = 0.2) +
  tm_shape(ak) +
  tm_borders(col = "#3399ff") +
  tm_fill(col = "#86baff") +
  tm_shape(wes) +
  tm_borders(col = "#3399ff") +
  tm_fill(col = "#86baff") +
  tm_layout(
    title = "Glaciers of Western North America",
    title.position = c("center", "top"),
    title.size = 1.1,
    bg.color = "#fcfcfc",
    inner.margins = c(0.06, 0.01, 0.09, 0.01),
    outer.margins = 0,
    frame.lwd = 0.2
  ) +
  tm_compass(
    type = "arrow",
    position = c("right", "top"),
    size = 1.2,
    text.size = 0.6
  ) +
  tm_scale_bar(
    breaks = c(0, 1000, 2000),
    position = c("right", "BOTTOM")
  )

Multi-layer map of the retreat of a glacier

tm_shape(ag) +
  tm_polygons("year", palette = "Blues") +
  tm_layout(
    title = "Agassiz Glacier",
    title.position = c("center", "top"),
    legend.position = c("left", "bottom"),
    legend.title.color = "#fcfcfc",
    legend.text.size = 1,
    bg.color = "#fcfcfc",
    inner.margins = c(0.07, 0.03, 0.07, 0.03),
    outer.margins = 0
  ) +
  tm_compass(
    type = "arrow",
    position = c("right", "top"),
    text.size = 0.7
  ) +
  tm_scale_bar(
    breaks = c(0, 0.5, 1),
    position = c("right", "BOTTOM"),
    text.size = 1
  )

Animated map of the retreat of a glacier

tmap_animation(tm_shape(ag) +
                 tm_polygons(col = "#86baff") +
                 tm_layout(
                   title = "Agassiz Glacier",
                   title.position = c("center", "top"),
                   legend.position = c("left", "bottom"),
                   legend.title.color = "#fcfcfc",
                   legend.text.size = 1,
                   bg.color = "#fcfcfc",
                   inner.margins = c(0.08, 0, 0.08, 0),
                   outer.margins = 0,
                   panel.label.bg.color = "#fcfcfc"
                 ) +
                 tm_compass(
                   type = "arrow",
                   position = c("right", "top"),
                   text.size = 0.7
                 ) +
                 tm_scale_bar(
                   breaks = c(0, 0.5, 1),
                   position = c("right", "BOTTOM"),
                   text.size = 1
                 ) +
                 tm_facets(
                   along = "year",
                   free.coords = F
                 )filename = "ag.gif",
               dpi = 300,
               inner.margins = c(0.08, 0, 0.08, 0),
               delay = 100
               )

So, how to get started in R?

Three-day introductory workshop for the HSS

As a follow-up to this year HSS Series, we will be offering a free three-day hands-on introduction to R for researchers in the humanities, arts, and social sciences

You can register here

Beyond the HSS series

Each region under the Alliance offers regular courses and workshops in R (and many other topics)

In the west, Alex Razoumov and myself offer regular free workshops, courses, and webinars for researchers in Canadian academic institutions

You can find our program here or join our mailing list here