Web scraping with Python

Author

Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.

Web scraping tools allow to automate parts of that process and Python is a popular language for the task.

In this workshop, I will guide you through a simple example using the package Beautiful Soup.

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>

Formatting:

  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.

Example for this workshop

We will use a website from the University of Tennessee containing a database of PhD theses from that university.

Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation.

We will only do this for the first page which contains the links to the 100 most recent theses. If you really wanted to gather all the data, you would have to do this for all pages.

Let’s look at the sites

First of all, let’s have a close look at the websites we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.

To create a dataframe with the data for all the dissertations on that first page, we need to do two things:

  • Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.

  • Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.

Load packages

Let’s load the packages that will make scraping websites with Python easier:

import requests                 # To download the html data from a site
from bs4 import BeautifulSoup   # To parse the html data
import time                     # To add a delay between each requests
import pandas as pd             # To store our data in a DataFrame

Send request to the main site

As mentioned above, our site is the database of PhD dissertations from the University of Tennessee.

Let’s create a string with the URL:

url = "https://trace.tennessee.edu/utk_graddiss/index.html"

First, we send a request to that URL and save the response in a variable called r:

r = requests.get(url)

Let’s see what our response looks like:

r
<Response [200]>

If you look in the list of HTTP status codes, you can see that a response with a code of 200 means that the request was successful.

Explore the raw data

To get the actual content of the response as unicode (text), we can use the text property of the response. This will give us the raw HTML markup from the webpage.

Let’s print the first 200 characters:

print(r.text[:200])

<!DOCTYPE html>
<html lang="en">
<head><!-- inj yui3-seed: --><script type='text/javascript' src='//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js'></script><script type='text/javascript' sr

Parse the data

The package Beautiful Soup transforms (parses) such HTML data into a parse tree, which will make extracting information easier.

Let’s create an object called mainpage with the parse tree:

mainpage = BeautifulSoup(r.text, "html.parser")

html.parser is the name of the parser that we are using here. It is better to use a specific parser to get consistent results across environments.

We can print the beginning of the parsed result:

print(mainpage.prettify()[:200])
<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- inj yui3-seed: -->
  <script src="//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js" type="text/javascript">
  </script>
  <script src="//ajax.g

The prettify method turns the BeautifulSoup object we created into a string (which is needed for slicing).

It doesn’t look any more clear to us, but it is now in a format the Beautiful Soup package can work with.

For instance, we can get the HTML segment containing the title with three methods:

  • using the title tag name:
mainpage.title
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
  • using find to look for HTML markers (tags, attributes, etc.):
mainpage.find("title")
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
  • using select which accepts CSS selectors:
mainpage.select("title")
[<title>
 Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
 </title>]

find will only return the first element. find_all will return all elements. select will also return all elements. Which one you chose depends on what you need to extract. There often several ways to get you there.

Here are other examples of data extraction:

mainpage.head
<head><!-- inj yui3-seed: --><script src="//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js" type="text/javascript"></script><script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js" type="text/javascript"></script><!-- Adobe Analytics --><script src="https://assets.adobedtm.com/4a848ae9611a/d0e96722185b/launch-d525bb0064d8.min.js" type="text/javascript"></script><!-- Cookies --><link href="//cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.0.3/cookieconsent.min.css" rel="stylesheet" type="text/css"/><script src="//cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.0.3/cookieconsent.min.js" type="text/javascript"></script><script src="/assets/nr_browser_production.js" type="text/javascript"></script>
<!-- def.1 -->
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
<!-- FILE meta-tags.inc --><!-- FILE: /srv/sequoia/main/data/assets/site/meta-tags.inc -->
<!-- FILE: meta-tags.inc (cont) -->
<!-- sh.1 -->
<link href="/ir-style.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-custom.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="ir-custom.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-local.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="ir-local.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-print.css" media="print" rel="stylesheet" type="text/css"/>
<link href="/assets/floatbox/floatbox.css" rel="stylesheet" type="text/css"/>
<link href="/recent.rss" rel="alternate" title="Site Feed" type="application/rss+xml"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<!--[if IE]>
<link rel="stylesheet" href="/ir-ie.css" type="text/css" media="screen">
<![endif]-->
<!-- JS -->
<script src="/assets/jsUtilities.js" type="text/javascript"></script>
<script src="/assets/footnoteLinks.js" type="text/javascript"></script>
<script src="/assets/scripts/yui-init.pack.js" type="text/javascript"></script>
<script src="/assets/scripts/bepress-init.pack.js" type="text/javascript"></script>
<script src="/assets/scripts/JumpListYUI.pack.js" type="text/javascript"></script>
<!-- end sh.1 -->
<script type="text/javascript">var pageData = {"page":{"environment":"prod","productName":"bpdg","language":"en","name":"ir_etd","businessUnit":"els:rp:st"},"visitor":{}};</script>
</head>
mainpage.a
<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>
mainpage.find_all("a")[:5]
[<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/do/search/advanced/" title="Search"><i class="icon-search"></i> Search</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/communities.html" title="Browse">Browse Collections</a>,
 <a data-scroll="" href="/cgi/myaccount.cgi?context=utk_graddiss" title="My Account">My Account</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/about.html" title="About">About</a>]
mainpage.select("a")[:5]
[<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/do/search/advanced/" title="Search"><i class="icon-search"></i> Search</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/communities.html" title="Browse">Browse Collections</a>,
 <a data-scroll="" href="/cgi/myaccount.cgi?context=utk_graddiss" title="My Account">My Account</a>,
 <a data-scroll="" href="https://trace.tennessee.edu/about.html" title="About">About</a>]

Test run

Identify relevant markers

The html code for this webpage contains the data we are interested in, but it is mixed in with a lot of HTML formatting and data we don’t care about. We need to extract the data relevant to us and turn it into a workable format.

The first step is to find the HTML markers that contain our data. One option is to use a web inspector or—even easier—the SelectorGadget, a JavaScript bookmarklet built by Andrew Cantino.

To use this tool, go to the SelectorGadget website and drag the link of the bookmarklet to your bookmarks bar.

Now, go to the dissertations database first page and click on the bookmarklet in your bookmarks bar. You will see a floating box at the bottom of your screen. As you move your mouse across the screen, an orange rectangle appears around each element over which you pass.

Click on one of the dissertation links: now, there is an a appearing in the box at the bottom as well as the number of elements selected. The selected elements are highlighted in yellow. Those elements are links (in HTML, a tags define hyperlinks).

As you can see, all the links we want are selected. However, there are many other links we don’t want that are also highlighted. In fact, all links in the document are selected. We need to remove the categories of links that we don’t want. To do this, hover above any of the links we don’t want. You will see a red rectangle around it. Click on it: now all similar links are gone. You might have to do this a few times until only the relevant links (i.e. those that lead to the dissertation information pages) remain highlighted.

As there are 100 such links per page, the count of selected elements in the bottom floating box should be down to 100.

In the main section of the floating box, you can now see: .article-listing a. This means that the data we want are under the HTML elements .article-listing a (the class .article-listing and the tag a).

Extract test URL

It is a good idea to test things out on a single element before doing a massive batch scraping of a site, so let’s test our method for the first dissertation.

To start, we need to extract the first URL. Here, we will use the CSS selectors (we can get there using find too). mainpage.select(".article-listing a") would give us all the results (100 links):

len(mainpage.select(".article-listing a"))
100

To get the first one, we index it:

mainpage.select(".article-listing a")[0]
<a href="https://trace.tennessee.edu/utk_graddiss/8076">Understanding host-microbe interactions in maize kernel and sweetpotato leaf metagenomic profiles.</a>

The actual URL is contained in the href attribute. Attributes can be extracted with the get method:

mainpage.select(".article-listing a")[0].get("href")
'https://trace.tennessee.edu/utk_graddiss/8076'

We now have our URL as a string. We can double-check that it is indeed a string:

type(mainpage.select(".article-listing a")[0].get("href"))
str

This is exactly what we need to send a request to that site, so let’s create an object url_test with it:

url_test = mainpage.select(".article-listing a")[0].get("href")

We have our first thesis URL:

print(url_test)
https://trace.tennessee.edu/utk_graddiss/8076

Send request to test URL

Now that we have the URL for the first dissertation information page, we want to extract the date, major, and advisor for that dissertation.

The first thing to do—as we did earlier with the database site—is to send a request to that page. Let’s assign it to a new object that we will call r_test:

r_test = requests.get(url_test)

Then we can parse it with Beautiful Soup (as we did before). Let’s create a dissertpage_test object:

dissertpage_test = BeautifulSoup(r_test.text, "html.parser")

Get data for test URL

It is time to extract the publication date, major, and advisor for our test URL.

Let’s start with the date. Thanks to the SelectorGadget, following the method we saw earlier, we can see that we now need elements marked by #publication_date p.

We can use select as we did earlier:

dissertpage_test.select("#publication_date p")
[<p>5-2023</p>]

Notice the square brackets around our result: this is import. It shows us that we have a ResultSet (a list of results specific to Beautiful Soup). This is because select returns all the results. Here, we have a single result, but the format is still list-like. Before we can go further, we need to index the value out of it:

dissertpage_test.select("#publication_date p")[0]
<p>5-2023</p>

We can now get the text out of this paragraph with the text attribute:

dissertpage_test.select("#publication_date p")[0].text
'5-2023'

We could save it in a variable date_test:

date_test = dissertpage_test.select("#publication_date p")[0].text

Your turn:

Get the major and advisor for our test URL.

Full run

Once everything is working for a test site, we can do some bulk scraping.

Extract all URLs

We already know how to get the 100 dissertations links from the main page: mainpage.select(".article-listing a"). Let’s assign it to a variable:

dissertlinks = mainpage.select(".article-listing a")

This ResultSet is an iterable, meaning that it can be used in a loop.

Let’s write a loop to extract all the URLs from this ResultSet of links:

# Create an empty list before filling it during the loop
urls = []

for link in dissertlinks:
    urls.append(link.get("href"))

Let’s see our first 5 URLs:

urls[:5]
['https://trace.tennessee.edu/utk_graddiss/8076',
 'https://trace.tennessee.edu/utk_graddiss/9158',
 'https://trace.tennessee.edu/utk_graddiss/8080',
 'https://trace.tennessee.edu/utk_graddiss/8086',
 'https://trace.tennessee.edu/utk_graddiss/8078']

Extract data from each page

For each element of urls (i.e. for each dissertation URL), we can now get our information.

# Create an empty list
ls = []

# For each element of our list of sites
for url in urls:
    # Send a request to the site
    r = requests.get(url)
    # Parse the result
    dissertpage = BeautifulSoup(r.text, "html.parser")
    # Get the date
    date = dissertpage.select("#publication_date p")[0].text
    # Get the major
    major = dissertpage.select("#department p")[0].text
    # Get the advisor
    advisor = dissertpage.select("#advisor1 p")[0].text
    # Store the results in the list
    ls.append((date, major, advisor))
    # Add a delay at each iteration
    time.sleep(0.1)

Some sites will block requests if they are too frequent. Adding a little delay between requests is often a good idea.

Store results in DataFrame

A DataFrame would be a lot more convenient than a list to hold our results.

First, we create a list with the column names for our future DataFrame:

cols = ["Date", "Major", "Advisor"]

Then we create our DataFrame:

df = pd.DataFrame(ls, columns=cols)
df
Date Major Advisor
0 5-2023 Life Sciences Bode A. Olukolu
1 12-2023 Industrial Engineering Hugh Medal
2 5-2023 Nuclear Engineering Erik Lukosi
3 5-2023 Energy Science and Engineering Kyle R. Gluesenkamp
4 5-2023 English Margaret Lazarus Dean
... ... ... ...
95 8-2023 Educational Psychology and Research Qi Sun
96 12-2023 Nuclear Engineering Lawrence H. Heilbronn
97 5-2023 Geology Bradley Thomson
98 5-2023 Natural Resources Sharon R. Jean-Philippe
99 12-2023 Psychology Greg Stuart

100 rows × 3 columns

Save results to file

As a final step, we will save our data to a CSV file:

df.to_csv('dissertations_data.csv', index=False)

The default index=True writes the row numbers. We are not writing these indices in our file by changing the value of this argument to False.

If you are using a Jupyter notebook or the IPython shell, you can type !ls to see that the file is there and !cat dissertations_data.csv to print its content.

! is a magic command that allows to run Unix shell commands in a notebook or IPython shell.