Web scraping with R


Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can, however, be extremely tedious. Web scraping tools allow to automate parts of that process and R is a popular language for the task.

In this workshop, we will guide you through a simple example using the package rvest.

Running R

For this workshop, we will use a temporary RStudio server.

To access it, go to the website given during the workshop and sign in using the username and password you will be given (you can ignore the OTP entry).

This will take you to our JupyterHub. There, click on the “RStudio” button and our RStudio server will open in a new tab.

Our RStudio server already has the two packages that we will be using installed (rvest and tibble). If you want to run the code on your machine, you need to install them with install.packages() first.


HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>


Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>


  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so if you plan on scraping sites at a fairly large scale, you should look into the polite package which will help you scrape responsibly.



We will use a website from the University of Tennessee containing a database of PhD theses from that university.

Our goal is to scrape data from this site to produce a dataframe with the date, major, and principal investigator (PI) for each dissertation.

We will only do this for the first page which contains the links to the 100 most recent theses. If you really wanted to gather all the data, you would have to do this for all pages.

Let’s look at the sites

First of all, let’s have a close look at the websites we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.

To create a dataframe with the data for all the dissertations on that first page, we need to do two things:

  • Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.

  • Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and principal investigator (PI) for each dissertation.


To do all this, we will use the package rvest, part of the tidyverse (a modern set of R packages). It is a package influenced by the popular Python package Beautiful Soup and it makes scraping websites with R really easy.

Let’s load it:


Read in HTML data from the main site

As mentioned above, our site is the database of PhD dissertations from the University of Tennessee.

Let’s create a character vector with the URL:

url <- "https://trace.tennessee.edu/utk_graddiss/index.html"

First, we read in the html data from that page:

html <- read_html(url)

Let’s have a look at the raw data:

<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<!-- FILE /srv/sequoia/main/data/trace.tennessee.edu/assets/heade ...

Test run

Identify the relevant HTML markers

The html code for this webpage contains the data we are interested in, but it is mixed in with a lot of HTML formatting information and data we aren’t interested in. We need to extract it and turn it into a workable format.

The first step is to find the CSS elements that contain the data we want. For this, you can use a web inspector or—even easier—the SelectorGadget, a JavaScript bookmarklet built by Andrew Cantino.

To use this tool, a simple option is to go to the SelectorGadget website and drag the link of the bookmarklet to your bookmarks bar.

Now, go to the dissertations database first page and click on the bookmarklet in your bookmarks bar. You will see a floating box at the bottom of your screen and an orange rectangle appears around each element of the page over which you hover your mouse. Click on one of the dissertation links: now, there is an a appearing in the box at the bottom as well as the number of elements selected. The selected elements are highlighted in yellow.

As you can see, all the links we want are selected. However, there are many other links we don’t want that are also highlighted. To remove those, hover on any of the links we don’t want. You will see a red rectangle around that unwanted link. Click on it: now, only the links we want (those that lead to the dissertation information pages) are highlighted in yellow and the count of selected elements in the bottom floating box is down to 100 (which makes sense since this site has 100 entries per page).

In the main section of the floating box, you can now see: .article-listing a. This means that the data we want are under the HTML elements .article-listing a (the CSS class .article-listing and the tag a).

Extract a test URL

It is a good idea to test things out on a single element before doing a massive batch scraping of a site, so let’s test our method for the first dissertation.

To start, we need to extract the first URL. The function html_element() from the package rvest extracts the first element matching some character. Let’s pass to this function our html object and the character ".article-listing a" and assign the result to an object that we will call test:

test <- html %>% html_element(".article-listing a")

%>% is a pipe from the magrittr tidyverse package. It passes the output from the left-hand side expression as the first argument of the right-hand side expression. We could have written this as:

test <- html_element(html, ".article-listing a")

Our new object is a list:

[1] "list"

Let’s print it:

<a href="https://trace.tennessee.edu/utk_graddiss/7600">

The URL is in there, so we successfully extracted the correct element, but we need to do more cleaning.

a is one of the HTML tags that have an attribute (href) as you can see when you print test. It is actually the value of that attribute that we want. To extract an attribute value, we use the function html_attr():

url_test <- test %>% html_attr("href")
[1] "https://trace.tennessee.edu/utk_graddiss/7600"

This is our URL.

 chr "https://trace.tennessee.edu/utk_graddiss/7600"

It is saved in a character vector, which is perfect.

Instead of creating the intermediate objects html and test, we could have chained the functions:

url_test <- read_html(url) %>%
  html_element(".article-listing a") %>%

Read in HTML data for our test URL

Now that we have the URL for the first dissertation information page, we want to extract the date, major, and principal investigator (PI) for that dissertation.

We just saw that url_test is a character vector representing a URL. We know how to deal with this.

The first thing to do—as we did earlier with the database site—is to read in the html data. Let’s assign it to a new object that we will call html_test:

html_test <- read_html(url_test)
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<!-- FILE /srv/sequoia/main/data/trace.tennessee.edu/assets/heade ...

Get data for our test URL

Now, we want to extract the publication date. Thanks to the SelectorGadget, following the method we saw earlier, we can see that we now need the element marked by #publication_date p.

We start by extracting the data as we did earlier by passing our object html_test and the character "#publication_date p" to html_element().

While earlier we wanted the value of a tag attribute (i.e. part of the metadata), here we want the actual text (i.e. part of the actual content). To extract text from a snippet of HTML, we pass it to html_text2().

Let’s run both operations at once to save the creation of an intermediate object:

date_test <- html_test %>%
  html_element("#publication_date p") %>%

Note the difference with what we did earlier to extract the URL: if we had used html_text2() then we would have gotten the text part of the link ("The Novel Chlorination of Zirconium Metal and Its Application to a Recycling Protocol for Zircaloy Cladding from Spent Nuclear Fuel Rods") rather than the URL ("https://trace.tennessee.edu/utk_graddiss/7600").

Let’s verify that our date object indeed contains the date:

[1] "5-2023"

We also want the major for this thesis. The SelectorGadget allows us to find that this time, it is the #department p element that we need. Let’s extract it in the same fashion:

major_test <- html_test %>%
  html_element("#department p") %>%
[1] "Chemistry"

And for the PI, we need the #advisor1 p element:

pi_test <- html_test %>%
  html_element("#advisor1 p") %>%
[1] "Craig E. Barnes"

Your turn:

Try using the SelectorGadget to identify the element necessary to extract the abstract of this dissertation.

Now, write the code to extract it and make sure you actually get what you want.

We now have the date, major, and PI for the first dissertation. We can create a matrix by passing them as arguments to cbind():

result_test <- cbind(date_test, major_test, pi_test)
     date_test major_test  pi_test          
[1,] "5-2023"  "Chemistry" "Craig E. Barnes"

Full run

Extract all URLs

Now that we have tested our code on the first dissertation, we can apply it on all 100 dissertations of the first page of the database.

Instead of using html_element(), this time we will use html_elements() which extracts all matching elements (instead of just the first one):

dat <- html %>% html_elements(".article-listing a")
{xml_nodeset (100)}
 [1] <a href="https://trace.tennessee.edu/utk_graddiss/7600">The Novel Chlori ...
 [2] <a href="https://trace.tennessee.edu/utk_graddiss/7714">Tethered Axial C ...
 [3] <a href="https://trace.tennessee.edu/utk_graddiss/7592">Novel Mixed Inte ...
 [4] <a href="https://trace.tennessee.edu/utk_graddiss/7146">IMPROVED AND SUS ...
 [5] <a href="https://trace.tennessee.edu/utk_graddiss/7086">Model Based Forc ...
 [6] <a href="https://trace.tennessee.edu/utk_graddiss/7603">Oscillation Anal ...
 [7] <a href="https://trace.tennessee.edu/utk_graddiss/7091">A STUDY OF THE E ...
 [8] <a href="https://trace.tennessee.edu/utk_graddiss/7123">Troya Victa : Em ...
 [9] <a href="https://trace.tennessee.edu/utk_graddiss/7699">INTRA-SKELETAL V ...
[10] <a href="https://trace.tennessee.edu/utk_graddiss/7687">Investigation of ...
[11] <a href="https://trace.tennessee.edu/utk_graddiss/7237">Elucidating mamm ...
[12] <a href="https://trace.tennessee.edu/utk_graddiss/7651">ANALYSIS OF PHYS ...
[13] <a href="https://trace.tennessee.edu/utk_graddiss/7094">Bacterial Commun ...
[14] <a href="https://trace.tennessee.edu/utk_graddiss/7257">Resituando El Cu ...
[15] <a href="https://trace.tennessee.edu/utk_graddiss/7158">Disrupting the C ...
[16] <a href="https://trace.tennessee.edu/utk_graddiss/7416">Leader Type and  ...
[17] <a href="https://trace.tennessee.edu/utk_graddiss/7736">Content External ...
[18] <a href="https://trace.tennessee.edu/utk_graddiss/7204">ONE SIZE DOES NO ...
[19] <a href="https://trace.tennessee.edu/utk_graddiss/7139">Making Sense of  ...
[20] <a href="https://trace.tennessee.edu/utk_graddiss/7108">Nodulin 26 like  ...
[1] "list"
[1] 100
[1] "list"

We now have a list of lists.

As we did for a single URL in the test run, we now want to extract all the URLs. We will do this using a loop.

Before running for loops, it is important to initialize empty loops. It is much more efficient than growing the result at each iteration.

So let’s initialize an empty list that we call list_urls of the appropriate size:

list_urls <- vector("list", length(dat))

Now we can run a loop to fill in our list:

for (i in seq_along(dat)) {
  list_urls[[i]] <- dat[[i]] %>% html_attr("href")

Let’s print again the first element of list_urls to make sure all looks good:

[1] "https://trace.tennessee.edu/utk_graddiss/7600"

We now have a list of URLs (in the form of character vectors) as we wanted.

Get the data from the list of URLs

We will now extract the data (date, major, and PI) for all URLs in our list.

Again, before running a for loop, we need to allocate memory first by creating an empty container (here a list):

list_data <- vector("list", length(list_urls))

We move the code we tested for a single URL inside a loop and we add one result to the list_data list at each iteration until we have all 100 dissertation sites scraped. Because there are quite a few of us running the code at the same time, we don’t want the site to block our request. To play safe, we will add a little delay (0.1 second) at each iteration (many sites will block requests if they are too frequent):

for (i in seq_along(list_urls)) {
  html <- read_html(list_urls[[i]])
  date <- html %>%
    html_element("#publication_date p") %>%
  major <- html %>%
    html_element("#department p") %>%
  pi <- html %>%
    html_element("#advisor1 p") %>%
  Sys.sleep(0.1)  # Add a little delay
  list_data[[i]] <- cbind(date, major, pi)

Let’s make sure all looks good by printing the first element of list_data:

     date     major       pi               
[1,] "5-2023" "Chemistry" "Craig E. Barnes"

We can turn this big list into a dataframe:

result <- do.call(rbind.data.frame, list_data)

result is a long dataframe, so we will only print the first few elements:

     date                                      major                  pi
1  5-2023                                  Chemistry     Craig E. Barnes
2 12-2022                                  Chemistry Dr. Ampofo K. Darko
3 12-2022                     Industrial Engineering     James Ostrowski
4  5-2022 Entomology, Plant Pathology and Nematology       Heather Kelly
5  5-2022                     Mechanical Engineering     Caleb D. Rucker
6 12-2022                     Electrical Engineering            Yilu Liu

If you like the tidyverse, you can turn it into a tibble:

result <- result %>% tibble::as_tibble()

The notation tibble::as_tibble() means that we are using the function as_tibble() from the package tibble. A tibble is the tidyverse version of a dataframe. One advantage is that it will only print the first 10 rows by default instead of printing the whole dataframe, so you don’t have to use head() when printing long dataframes:

# A tibble: 100 × 3
   date    major                                      pi                   
   <chr>   <chr>                                      <chr>                
 1 5-2023  Chemistry                                  Craig E. Barnes      
 2 12-2022 Chemistry                                  Dr. Ampofo K. Darko  
 3 12-2022 Industrial Engineering                     James Ostrowski      
 4 5-2022  Entomology, Plant Pathology and Nematology Heather Kelly        
 5 5-2022  Mechanical Engineering                     Caleb D. Rucker      
 6 12-2022 Electrical Engineering                     Yilu Liu             
 7 5-2022  Comparative and Experimental Medicine      Brian K. Whitlock    
 8 5-2022  History                                    Jay Rubenstein       
 9 12-2022 Anthropology                               Dawnie W. Steadman   
10 12-2022 Mechanical Engineering                     Stephanie C. TerMaath
# ℹ 90 more rows

We can capitalize the headers:

names(result) <- c("Date", "Major", "PI")

This is what our final result looks like:

# A tibble: 100 × 3
   Date    Major                                      PI                   
   <chr>   <chr>                                      <chr>                
 1 5-2023  Chemistry                                  Craig E. Barnes      
 2 12-2022 Chemistry                                  Dr. Ampofo K. Darko  
 3 12-2022 Industrial Engineering                     James Ostrowski      
 4 5-2022  Entomology, Plant Pathology and Nematology Heather Kelly        
 5 5-2022  Mechanical Engineering                     Caleb D. Rucker      
 6 12-2022 Electrical Engineering                     Yilu Liu             
 7 5-2022  Comparative and Experimental Medicine      Brian K. Whitlock    
 8 5-2022  History                                    Jay Rubenstein       
 9 12-2022 Anthropology                               Dawnie W. Steadman   
10 12-2022 Mechanical Engineering                     Stephanie C. TerMaath
# ℹ 90 more rows

Functions recap

Below is a recapitulation of the rvest functions we have used today:

Functions Usage
read_html() Read in HTML from URL
html_element() Extract first matching element
html_elements() Extract all matching elements
html_attr() Extract the value of an attribute
html_text2() Extract text