import requests # To download the html data from a site
from bs4 import BeautifulSoup # To parse the html data
import time # To add a delay between each requests
import pandas as pd # To store our data in a DataFrame
Web scraping with Python
The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.
Web scraping tools allow to automate parts of that process and Python is a popular language for the task.
In this workshop, I will guide you through a simple example using the package Beautiful Soup.
HTML and CSS
HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.
HTML uses tags of the form:
<some_tag>Your content</some_tag>
Some tags have attributes:
<some_tag attribute_name="attribute value">Your content</some_tag>
Examples:
Site structure:
<h2>This is a heading of level 2</h2>
<p>This is a paragraph</p>
Formatting:
<b>This is bold</b>
<a href="https://some.url">This is the text for a link</a>
Web scrapping
Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.
While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.
Example for this workshop
We will use a website from the University of Tennessee containing a database of PhD theses from that university.
Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation.
We will only do this for the first page which contains the links to the 100 most recent theses. If you really wanted to gather all the data, you would have to do this for all pages.
Let’s look at the sites
First of all, let’s have a close look at the websites we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.
To create a dataframe with the data for all the dissertations on that first page, we need to do two things:
Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.
Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.
Load packages
Let’s load the packages that will make scraping websites with Python easier:
Send request to the main site
As mentioned above, our site is the database of PhD dissertations from the University of Tennessee.
Let’s create a string with the URL:
= "https://trace.tennessee.edu/utk_graddiss/index.html" url
First, we send a request to that URL and save the response in a variable called r
:
= requests.get(url) r
Let’s see what our response looks like:
r
<Response [200]>
If you look in the list of HTTP status codes, you can see that a response with a code of 200
means that the request was successful.
Explore the raw data
To get the actual content of the response as unicode (text), we can use the text
property of the response. This will give us the raw HTML markup from the webpage.
Let’s print the first 200 characters:
print(r.text[:200])
<!DOCTYPE html>
<html lang="en">
<head><!-- inj yui3-seed: --><script type='text/javascript' src='//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js'></script><script type='text/javascript' sr
Parse the data
The package Beautiful Soup transforms (parses) such HTML data into a parse tree, which will make extracting information easier.
Let’s create an object called mainpage
with the parse tree:
= BeautifulSoup(r.text, "html.parser") mainpage
html.parser
is the name of the parser that we are using here. It is better to use a specific parser to get consistent results across environments.
We can print the beginning of the parsed result:
print(mainpage.prettify()[:200])
<!DOCTYPE html>
<html lang="en">
<head>
<!-- inj yui3-seed: -->
<script src="//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js" type="text/javascript">
</script>
<script src="//ajax.g
The prettify
method turns the BeautifulSoup object we created into a string (which is needed for slicing).
It doesn’t look any more clear to us, but it is now in a format the Beautiful Soup package can work with.
For instance, we can get the HTML segment containing the title with three methods:
- using the title tag name:
mainpage.title
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
- using
find
to look for HTML markers (tags, attributes, etc.):
"title") mainpage.find(
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
- using
select
which accepts CSS selectors:
"title") mainpage.select(
[<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>]
find
will only return the first element. find_all
will return all elements. select
will also return all elements. Which one you chose depends on what you need to extract. There often several ways to get you there.
Here are other examples of data extraction:
mainpage.head
<head><!-- inj yui3-seed: --><script src="//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js" type="text/javascript"></script><script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js" type="text/javascript"></script><!-- Adobe Analytics --><script src="https://assets.adobedtm.com/4a848ae9611a/d0e96722185b/launch-d525bb0064d8.min.js" type="text/javascript"></script><script src="/assets/nr_browser_production.js" type="text/javascript"></script>
<!-- def.1 -->
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<title>
Doctoral Dissertations | Graduate School | University of Tennessee, Knoxville
</title>
<!-- FILE meta-tags.inc --><!-- FILE: /srv/sequoia/main/data/assets/site/meta-tags.inc -->
<!-- FILE: meta-tags.inc (cont) -->
<!-- sh.1 -->
<link href="/ir-style.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-custom.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="ir-custom.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-local.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="ir-local.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/ir-print.css" media="print" rel="stylesheet" type="text/css"/>
<link href="/assets/floatbox/floatbox.css" rel="stylesheet" type="text/css"/>
<link href="/recent.rss" rel="alternate" title="Site Feed" type="application/rss+xml"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<!--[if IE]>
<link rel="stylesheet" href="/ir-ie.css" type="text/css" media="screen">
<![endif]-->
<!-- JS -->
<script src="/assets/jsUtilities.js" type="text/javascript"></script>
<script src="/assets/footnoteLinks.js" type="text/javascript"></script>
<script src="/assets/scripts/yui-init.pack.js" type="text/javascript"></script>
<script src="/assets/scripts/bepress-init.pack.js" type="text/javascript"></script>
<script src="/assets/scripts/JumpListYUI.pack.js" type="text/javascript"></script>
<!-- end sh.1 -->
<script type="text/javascript">var pageData = {"page":{"environment":"prod","productName":"bpdg","language":"en","name":"ir_etd","businessUnit":"els:rp:st"},"visitor":{}};</script>
</head>
mainpage.a
<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>
"a")[:5] mainpage.find_all(
[<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>,
<a data-scroll="" href="https://trace.tennessee.edu/do/search/advanced/" title="Search"><i class="icon-search"></i> Search</a>,
<a data-scroll="" href="https://trace.tennessee.edu/communities.html" title="Browse">Browse Collections</a>,
<a data-scroll="" href="/cgi/myaccount.cgi?context=utk_graddiss" title="My Account">My Account</a>,
<a data-scroll="" href="https://trace.tennessee.edu/about.html" title="About">About</a>]
"a")[:5] mainpage.select(
[<a data-scroll="" href="https://trace.tennessee.edu" title="Home">Home</a>,
<a data-scroll="" href="https://trace.tennessee.edu/do/search/advanced/" title="Search"><i class="icon-search"></i> Search</a>,
<a data-scroll="" href="https://trace.tennessee.edu/communities.html" title="Browse">Browse Collections</a>,
<a data-scroll="" href="/cgi/myaccount.cgi?context=utk_graddiss" title="My Account">My Account</a>,
<a data-scroll="" href="https://trace.tennessee.edu/about.html" title="About">About</a>]
Test run
Identify relevant markers
The html code for this webpage contains the data we are interested in, but it is mixed in with a lot of HTML formatting and data we don’t care about. We need to extract the data relevant to us and turn it into a workable format.
The first step is to find the HTML markers that contain our data. One option is to use a web inspector or—even easier—the SelectorGadget, a JavaScript bookmarklet built by Andrew Cantino.
To use this tool, go to the SelectorGadget website and drag the link of the bookmarklet to your bookmarks bar.
Now, go to the dissertations database first page and click on the bookmarklet in your bookmarks bar. You will see a floating box at the bottom of your screen. As you move your mouse across the screen, an orange rectangle appears around each element over which you pass.
Click on one of the dissertation links: now, there is an a
appearing in the box at the bottom as well as the number of elements selected. The selected elements are highlighted in yellow. Those elements are links (in HTML, a
tags define hyperlinks).
As you can see, all the links we want are selected. However, there are many other links we don’t want that are also highlighted. In fact, all links in the document are selected. We need to remove the categories of links that we don’t want. To do this, hover above any of the links we don’t want. You will see a red rectangle around it. Click on it: now all similar links are gone. You might have to do this a few times until only the relevant links (i.e. those that lead to the dissertation information pages) remain highlighted.
As there are 100 such links per page, the count of selected elements in the bottom floating box should be down to 100.
In the main section of the floating box, you can now see: .article-listing a
. This means that the data we want are under the HTML elements .article-listing a
(the class .article-listing
and the tag a
).
Extract test URL
It is a good idea to test things out on a single element before doing a massive batch scraping of a site, so let’s test our method for the first dissertation.
To start, we need to extract the first URL. Here, we will use the CSS selectors (we can get there using find
too). mainpage.select(".article-listing a")
would give us all the results (100 links):
len(mainpage.select(".article-listing a"))
100
To get the first one, we index it:
".article-listing a")[0] mainpage.select(
<a href="https://trace.tennessee.edu/utk_graddiss/10400">The Sons of Melisende: Baldwin III, Amalric, and Kingship in the Kingdom of Jerusalem, 1143-1174 CE</a>
The actual URL is contained in the href
attribute. Attributes can be extracted with the get
method:
".article-listing a")[0].get("href") mainpage.select(
'https://trace.tennessee.edu/utk_graddiss/10400'
We now have our URL as a string. We can double-check that it is indeed a string:
type(mainpage.select(".article-listing a")[0].get("href"))
str
This is exactly what we need to send a request to that site, so let’s create an object url_test
with it:
= mainpage.select(".article-listing a")[0].get("href") url_test
We have our first thesis URL:
print(url_test)
https://trace.tennessee.edu/utk_graddiss/10400
Send request to test URL
Now that we have the URL for the first dissertation information page, we want to extract the date, major, and advisor for that dissertation.
The first thing to do—as we did earlier with the database site—is to send a request to that page. Let’s assign it to a new object that we will call r_test
:
= requests.get(url_test) r_test
Then we can parse it with Beautiful Soup (as we did before). Let’s create a dissertpage_test
object:
= BeautifulSoup(r_test.text, "html.parser") dissertpage_test
Get data for test URL
It is time to extract the publication date, major, and advisor for our test URL.
Let’s start with the date. Thanks to the SelectorGadget, following the method we saw earlier, we can see that we now need elements marked by #publication_date p
.
We can use select
as we did earlier:
"#publication_date p") dissertpage_test.select(
[<p>8-2024</p>]
Notice the square brackets around our result: this is import. It shows us that we have a ResultSet (a list of results specific to Beautiful Soup). This is because select
returns all the results. Here, we have a single result, but the format is still list-like. Before we can go further, we need to index the value out of it:
"#publication_date p")[0] dissertpage_test.select(
<p>8-2024</p>
We can now get the text out of this paragraph with the text
attribute:
"#publication_date p")[0].text dissertpage_test.select(
'8-2024'
We could save it in a variable date_test
:
= dissertpage_test.select("#publication_date p")[0].text date_test
Your turn:
Get the major and advisor for our test URL.
Full run
Once everything is working for a test site, we can do some bulk scraping.
Extract all URLs
We already know how to get the 100 dissertations links from the main page: mainpage.select(".article-listing a")
. Let’s assign it to a variable:
= mainpage.select(".article-listing a") dissertlinks
This ResultSet is an iterable, meaning that it can be used in a loop.
Let’s write a loop to extract all the URLs from this ResultSet of links:
# Create an empty list before filling it during the loop
= []
urls
for link in dissertlinks:
"href")) urls.append(link.get(
Let’s see our first 5 URLs:
5] urls[:
['https://trace.tennessee.edu/utk_graddiss/10400',
'https://trace.tennessee.edu/utk_graddiss/10081',
'https://trace.tennessee.edu/utk_graddiss/10401',
'https://trace.tennessee.edu/utk_graddiss/10082',
'https://trace.tennessee.edu/utk_graddiss/10424']
Extract data from each page
For each element of urls
(i.e. for each dissertation URL), we can now get our information.
# Create an empty list
= []
ls
# For each element of our list of sites
for url in urls:
# Send a request to the site
= requests.get(url)
r # Parse the result
= BeautifulSoup(r.text, "html.parser")
dissertpage # Get the date
= dissertpage.select("#publication_date p")[0].text
date # Get the major
= dissertpage.select("#department p")[0].text
major # Get the advisor
= dissertpage.select("#advisor1 p")[0].text
advisor # Store the results in the list
ls.append((date, major, advisor))# Add a delay at each iteration
0.1) time.sleep(
Some sites will block requests if they are too frequent. Adding a little delay between requests is often a good idea.
Store results in DataFrame
A DataFrame would be a lot more convenient than a list to hold our results.
First, we create a list with the column names for our future DataFrame:
= ["Date", "Major", "Advisor"] cols
Then we create our DataFrame:
= pd.DataFrame(ls, columns=cols) df
df
Date | Major | Advisor | |
---|---|---|---|
0 | 8-2024 | History | Jay Rubenstein |
1 | 5-2024 | Business Administration | Larry A. Fauver |
2 | 8-2024 | Electrical Engineering | Dan Wilson |
3 | 5-2024 | Chemical Engineering | Steven M. Abel |
4 | 8-2024 | Geography | Sally P Horn |
... | ... | ... | ... |
95 | 5-2024 | Data Science and Engineering | Debangshu Mukherjee |
96 | 8-2024 | Environmental Engineering | Terry C. Hazen |
97 | 8-2024 | Sociology | Dr. Jon D. Shefner |
98 | 5-2024 | Mechanical Engineering | Seungha Shin |
99 | 5-2024 | Experimental Psychology | Garriy Shteynberg |
100 rows × 3 columns
Save results to file
As a final step, we will save our data to a CSV file:
'dissertations_data.csv', index=False) df.to_csv(
The default index=True
writes the row numbers. We are not writing these indices in our file by changing the value of this argument to False
.
If you are using a Jupyter notebook or the IPython shell, you can type !ls
to see that the file is there and !cat dissertations_data.csv
to print its content.
!
is a magic command that allows to run Unix shell commands in a notebook or IPython shell.