Web scraping with R

Author

Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can, however, be extremely tedious. Web scraping tools allow to automate parts of that process and R is a popular language for the task.

In this workshop, we will guide you through a simple example using the package rvest.

For this workshop, we will use a temporary RStudio server.

To access it, go to the website given during the workshop and sign in using the username and password you will be given (you can ignore the OTP entry).

This will take you to our JupyterHub. There, click on the “RStudio” button and our RStudio server will open in a new tab.

Our RStudio server already has the two packages that we will be using installed (rvest and tibble). If you want to run the code on your machine, you need to install them with install.packages() first.

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>

Formatting:

  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so if you plan on scraping sites at a fairly large scale, you should look into the polite package which will help you scrape responsibly.

Example for this workshop

We will use a website from the University of Tennessee containing a database of PhD theses from that university.

Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation.

We will only do this for the first page which contains the links to the 100 most recent theses. If you really wanted to gather all the data, you would have to do this for all pages.

Let’s look at the sites

First of all, let’s have a close look at the websites we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.

To create a dataframe with the data for all the dissertations on that first page, we need to do two things:

  • Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.

  • Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.

Package

To do all this, we will use the package rvest, part of the tidyverse (a modern set of R packages). It is a package influenced by the popular Python package Beautiful Soup and it makes scraping websites with R really easy.

Let’s load it:

library(rvest)

Read in HTML from main site

As mentioned above, our site is the database of PhD dissertations from the University of Tennessee.

Let’s create a character vector with the URL:

url <- "https://trace.tennessee.edu/utk_graddiss/index.html"

First, we read in the html data from that page:

html <- read_html(url)

Let’s have a look at the raw data:

html
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<!-- FILE /srv/sequoia/main/data/trace.tennessee.edu/assets/heade ...

Test run

Identify the relevant HTML markers

The html code for this webpage contains the data we are interested in, but it is mixed in with a lot of HTML formatting and data we don’t care about. We need to extract the data relevant to us and turn it into a workable format.

The first step is to find the HTML markers that contain our data. One option is to use a web inspector or—even easier—the SelectorGadget, a JavaScript bookmarklet built by Andrew Cantino.

To use this tool, go to the SelectorGadget website and drag the link of the bookmarklet to your bookmarks bar.

Now, go to the dissertations database first page and click on the bookmarklet in your bookmarks bar. You will see a floating box at the bottom of your screen. As you move your mouse across the screen, an orange rectangle appears around each element over which you pass.

Click on one of the dissertation links: now, there is an a appearing in the box at the bottom as well as the number of elements selected. The selected elements are highlighted in yellow. Those elements are links (in HTML, a tags define hyperlinks).

As you can see, all the links we want are selected. However, there are many other links we don’t want that are also highlighted. In fact, all links in the document are selected. We need to remove the categories of links that we don’t want. To do this, hover above any of the links we don’t want. You will see a red rectangle around it. Click on it: now all similar links are gone. You might have to do this a few times until only the relevant links (i.e. those that lead to the dissertation information pages) remain highlighted.

As there are 100 such links per page, the count of selected elements in the bottom floating box should be down to 100.

In the main section of the floating box, you can now see: .article-listing a. This means that the data we want are under the HTML elements .article-listing a (the class .article-listing and the tag a).

Extract test URL

It is a good idea to test things out on a single element before doing a massive batch scraping of a site, so let’s test our method for the first dissertation.

To start, we need to extract the first URL. The function html_element() from the package rvest extracts the first element matching some character. Let’s pass to this function our html object and the character ".article-listing a" and assign the result to an object that we will call test:

test <- html %>% html_element(".article-listing a")

%>% is a pipe from the magrittr tidyverse package. It passes the output from the left-hand side expression as the first argument of the right-hand side expression. We could have written this as:

test <- html_element(html, ".article-listing a")

Our new object is a list:

typeof(test)
[1] "list"

Let’s print it:

test
{html_node}
<a href="https://trace.tennessee.edu/utk_graddiss/10400">

The URL is in there, so we successfully extracted the correct element, but we need to do more cleaning.

a is one of the HTML tags that have an attribute (href) as you can see when you print test. It is actually the value of that attribute that we want. To extract an attribute value, we use the function html_attr():

url_test <- test %>% html_attr("href")
url_test
[1] "https://trace.tennessee.edu/utk_graddiss/10400"

This is our URL.

str(url_test)
 chr "https://trace.tennessee.edu/utk_graddiss/10400"

It is saved in a character vector, which is perfect.

Instead of creating the intermediate objects html and test, we could have chained the functions:

url_test <- read_html(url) %>%
  html_element(".article-listing a") %>%
  html_attr("href")

Read in HTML data for test URL

Now that we have the URL for the first dissertation information page, we want to extract the date, major, and advisor for that dissertation.

We just saw that url_test is a character vector representing a URL. We know how to deal with this.

The first thing to do—as we did earlier with the database site—is to read in the html data. Let’s assign it to a new object that we will call html_test:

html_test <- read_html(url_test)
html_test
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<!-- FILE /srv/sequoia/main/data/trace.tennessee.edu/assets/heade ...

Get data for test URL

Now, we want to extract the publication date. Thanks to the SelectorGadget, following the method we saw earlier, we can see that we now need the element marked by #publication_date p.

We start by extracting the data as we did earlier by passing our object html_test and the character "#publication_date p" to html_element().

While earlier we wanted the value of a tag attribute (i.e. part of the metadata), here we want the actual text (i.e. part of the actual content). To extract text from a snippet of HTML, we pass it to html_text2().

Let’s run both operations at once to save the creation of an intermediate object:

date_test <- html_test %>%
  html_element("#publication_date p") %>%
  html_text2()

Note the difference with what we did earlier to extract the URL: if we had used html_text2() then we would have gotten the text part of the link ("The Novel Chlorination of Zirconium Metal and Its Application to a Recycling Protocol for Zircaloy Cladding from Spent Nuclear Fuel Rods") rather than the URL ("https://trace.tennessee.edu/utk_graddiss/7600").

Let’s verify that our date object indeed contains the date:

date_test
[1] "8-2024"

We also want the major for this thesis. The SelectorGadget allows us to find that this time, it is the #department p element that we need. Let’s extract it in the same fashion:

major_test <- html_test %>%
  html_element("#department p") %>%
  html_text2()
major_test
[1] "History"

And for the advisor, we need the #advisor1 p element:

advisor_test <- html_test %>%
  html_element("#advisor1 p") %>%
  html_text2()
advisor_test
[1] "Jay Rubenstein"

Your turn:

Try using the SelectorGadget to identify the element necessary to extract the abstract of this dissertation.

Now, write the code to extract it and make sure you actually get what you want.

We now have the date, major, and advisor for the first dissertation. We can create a matrix by passing them as arguments to cbind():

result_test <- cbind(date_test, major_test, advisor_test)
result_test
     date_test major_test advisor_test    
[1,] "8-2024"  "History"  "Jay Rubenstein"

Full run

Extract all URLs

Now that we have tested our code on the first dissertation, we can apply it on all 100 dissertations of the first page of the database.

Instead of using html_element(), this time we will use html_elements() which extracts all matching elements (instead of just the first one):

dat <- html %>% html_elements(".article-listing a")
dat
{xml_nodeset (100)}
 [1] <a href="https://trace.tennessee.edu/utk_graddiss/10400">The Sons of Mel ...
 [2] <a href="https://trace.tennessee.edu/utk_graddiss/10081">Checking the Bo ...
 [3] <a href="https://trace.tennessee.edu/utk_graddiss/10401">Data-Driven Mod ...
 [4] <a href="https://trace.tennessee.edu/utk_graddiss/10082">Computational S ...
 [5] <a href="https://trace.tennessee.edu/utk_graddiss/10424">Multi-Proxy Stu ...
 [6] <a href="https://trace.tennessee.edu/utk_graddiss/10402">Mechanisms of G ...
 [7] <a href="https://trace.tennessee.edu/utk_graddiss/10425">Applying Minori ...
 [8] <a href="https://trace.tennessee.edu/utk_graddiss/10426">Alchemy of the  ...
 [9] <a href="https://trace.tennessee.edu/utk_graddiss/10083">Topics in Energ ...
[10] <a href="https://trace.tennessee.edu/utk_graddiss/10427">Unraveling the  ...
[11] <a href="https://trace.tennessee.edu/utk_graddiss/10084">Someone to Keep ...
[12] <a href="https://trace.tennessee.edu/utk_graddiss/10085">Perceptions of  ...
[13] <a href="https://trace.tennessee.edu/utk_graddiss/10403">The Influence o ...
[14] <a href="https://trace.tennessee.edu/utk_graddiss/10086">Motor Control Q ...
[15] <a href="https://trace.tennessee.edu/utk_graddiss/10428">Laser Directed  ...
[16] <a href="https://trace.tennessee.edu/utk_graddiss/10087">Music to their  ...
[17] <a href="https://trace.tennessee.edu/utk_graddiss/10429">CodonT5: A Mult ...
[18] <a href="https://trace.tennessee.edu/utk_graddiss/10088">Who Benefits fr ...
[19] <a href="https://trace.tennessee.edu/utk_graddiss/10089">Applying a One  ...
[20] <a href="https://trace.tennessee.edu/utk_graddiss/10430">The World of Lé ...
...
typeof(dat)
[1] "list"
length(dat)
[1] 100
typeof(dat[[1]])
[1] "list"

We now have a list of lists.

As we did for a single URL in the test run, we now want to extract all the URLs. We will do this using a loop.

Before running for loops, it is important to initialize empty loops. It is much more efficient than growing the result at each iteration.

So let’s initialize an empty list that we call list_urls of the appropriate size:

list_urls <- vector("list", length(dat))

Now we can run a loop to fill in our list:

for (i in seq_along(dat)) {
  list_urls[[i]] <- dat[[i]] %>% html_attr("href")
}

Let’s print again the first element of list_urls to make sure all looks good:

list_urls[[1]]
[1] "https://trace.tennessee.edu/utk_graddiss/10400"

We now have a list of URLs (in the form of character vectors) as we wanted.

Extract data from each page

We will now extract the data (date, major, and advisor) for all URLs in our list.

Again, before running a for loop, we need to allocate memory first by creating an empty container (here a list):

list_data <- vector("list", length(list_urls))

We move the code we tested for a single URL inside a loop and we add one result to the list_data list at each iteration until we have all 100 dissertation sites scraped. Because there are quite a few of us running the code at the same time, we don’t want the site to block our request. To play safe, we will add a little delay (0.1 second) at each iteration (many sites will block requests if they are too frequent):

for (i in seq_along(list_urls)) {
  html <- read_html(list_urls[[i]])
  date <- html %>%
    html_element("#publication_date p") %>%
    html_text2()
  major <- html %>%
    html_element("#department p") %>%
    html_text2()
  advisor <- html %>%
    html_element("#advisor1 p") %>%
    html_text2()
  Sys.sleep(0.1)  # Add a little delay
  list_data[[i]] <- cbind(date, major, advisor)
}

Let’s make sure all looks good by printing the first element of list_data:

list_data[[1]]
     date     major     advisor         
[1,] "8-2024" "History" "Jay Rubenstein"

Store results in DataFrame

We can turn this big list into a dataframe:

result <- do.call(rbind.data.frame, list_data)

result is a long dataframe, so we will only print the first few elements:

head(result)
    date                                           major              advisor
1 8-2024                                         History       Jay Rubenstein
2 5-2024                         Business Administration      Larry A. Fauver
3 8-2024                          Electrical Engineering           Dan Wilson
4 5-2024                            Chemical Engineering       Steven M. Abel
5 8-2024                                       Geography         Sally P Horn
6 8-2024 Biochemistry and Cellular and Molecular Biology Dr. Mariano Labrador

If you like the tidyverse, you can turn it into a tibble:

result <- result %>% tibble::as_tibble()

The notation tibble::as_tibble() means that we are using the function as_tibble() from the package tibble. A tibble is the tidyverse version of a dataframe. One advantage is that it will only print the first 10 rows by default instead of printing the whole dataframe, so you don’t have to use head() when printing long dataframes:

result
# A tibble: 100 × 3
   date   major                                           advisor             
   <chr>  <chr>                                           <chr>               
 1 8-2024 History                                         Jay Rubenstein      
 2 5-2024 Business Administration                         Larry A. Fauver     
 3 8-2024 Electrical Engineering                          Dan Wilson          
 4 5-2024 Chemical Engineering                            Steven M. Abel      
 5 8-2024 Geography                                       Sally P Horn        
 6 8-2024 Biochemistry and Cellular and Molecular Biology Dr. Mariano Labrador
 7 8-2024 Psychology                                      Kristina C. Gordon  
 8 8-2024 Counselor Education                             Joel F. Diambra     
 9 5-2024 Economics                                       James Scott Holladay
10 8-2024 Chemistry                                       Thanh D. Do         
# ℹ 90 more rows

We can capitalize the headers:

names(result) <- c("Date", "Major", "Advisor")

This is what our final result looks like:

result
# A tibble: 100 × 3
   Date   Major                                           Advisor             
   <chr>  <chr>                                           <chr>               
 1 8-2024 History                                         Jay Rubenstein      
 2 5-2024 Business Administration                         Larry A. Fauver     
 3 8-2024 Electrical Engineering                          Dan Wilson          
 4 5-2024 Chemical Engineering                            Steven M. Abel      
 5 8-2024 Geography                                       Sally P Horn        
 6 8-2024 Biochemistry and Cellular and Molecular Biology Dr. Mariano Labrador
 7 8-2024 Psychology                                      Kristina C. Gordon  
 8 8-2024 Counselor Education                             Joel F. Diambra     
 9 5-2024 Economics                                       James Scott Holladay
10 8-2024 Chemistry                                       Thanh D. Do         
# ℹ 90 more rows

Save results to file

As a final step, we will save our data to a CSV file:

write.csv(result, "dissertations_data.csv", row.names = FALSE)

Functions recap

Below is a recapitulation of the rvest functions we have used today:

Functions Usage
read_html() Read in HTML from URL
html_element() Extract first matching element
html_elements() Extract all matching elements
html_attr() Extract the value of an attribute
html_text2() Extract text

Recording