The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.
Some websites have an API that makes it easy to extract information. These are websites that were built with the intention of being scraped (e.g. sites that contain databases, museums, art collections, etc.). When this is the case, this is definitely the way to go. Most websites however do not contain an API that can be queried.
Web scraping tools allow to automate parts of that process and Python is a popular language for the task.
Of note, an increasing number of websites use JavaScript to add cookies, interactivity, etc. to websites. This makes them a lot harder to scrape and require more sophisticated tools.
In this section, we will scrape a simple site that does not contain any JavaScript using the package Beautiful Soup.
We will use an LLM to help us in this process.
Background information
HTML and CSS
HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.
<a href="https://some.url">This is the text for a link</a>
Web scrapping
Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.
While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.
Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation. Scraping any other site which uses the Digital Commons Network can be done following a similar workflow.
Site exploration
First of all, let’s have a close look at the website we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.
To see how the website is built, you can open the web inspector. Most browsers allow to right-click on an element and select “inspect” in the menu. Alternatively, Chrome opens the web inspector with F12 and Firefox and Edge with Ctrl+Shift+I (Windows, Linux) or Command+Option+I (macOS).
To create a dataframe with the data for all the dissertations on the first page, we need to do two things:
Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.
Note that we are only scraping the first page (with the 100 most recent theses) to keep things simple. If you wanted to get all the data from the site, you would have to create a loop and repeat this over all 117 pages of the site.
LLM code
Let’s use an LLM to get us started.
Here is the prompt I used:
Write Python code to scrape the following website: https://trace.tennessee.edu/utk_graddiss/index.html.
As the end result, I want to have a polars dataframe with the following columns:
Date
Major of the thesis
Advisor (PI)
import requestsfrom bs4 import BeautifulSoupimport polars as plfrom urllib.parse import urljoinbase_url ="https://trace.tennessee.edu/utk_graddiss/index.html"def get_dissertation_data():# Get main page and extract dissertation URLs main_page = requests.get(base_url) soup = BeautifulSoup(main_page.content, "html.parser")# Extract individual dissertation links links = [urljoin(base_url, a["href"])for a in soup.select(".article-listing a")] data = []for link in links[:10]: # Reduced to 10 for testing; remove slice for full 100try:# Get individual dissertation page page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser")# Extract required fields date = page_soup.select_one("#publication_date p").text.strip() major = page_soup.select_one("#department p").text.strip() advisor = page_soup.select_one("#advisor1 p").text.strip() data.append({"Date": date,"Major": major,"Advisor": advisor })exceptExceptionas e:print(f"Error processing {link}: {str(e)}")return pl.DataFrame(data)# Usagedf = get_dissertation_data()print(df)
shape: (10, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date ┆ Major ┆ Advisor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024 ┆ History ┆ Jay Rubenstein │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024 ┆ Business Administration ┆ Larry A. Fauver │
│ 8-2024 ┆ Electrical Engineering ┆ Dan Wilson │
│ 12-2024 ┆ Mechanical Engineering ┆ Prashant Singh │
│ 5-2024 ┆ Chemical Engineering ┆ Steven M. Abel │
│ 12-2024 ┆ Industrial Engineering ┆ Anahita Khojandi │
│ 12-2024 ┆ Education ┆ Clara Lee Brown │
│ 8-2024 ┆ Geography ┆ Sally P Horn │
│ 12-2024 ┆ Education ┆ Enlida J Romero-Hall │
└─────────┴─────────────────────────────────┴────────────────────────┘
The package Beautiful Soup—loaded in Python as bs4—transforms (parses) HTML data into a parse tree, which makes extracting information easier.
Your turn:
Did it work? Go to the site and verify the data.
What is the problem? Can you fix it?
Code improvements
LLMs can be very helpful in getting you started, but you will often have to tweak the code to improve it—even when it works.
We now have code that works, but its downside is that we created a function that can only work on one webpage… so it is of limited use. If, for instance, we wanted to apply that function to the second page of the site (https://trace.tennessee.edu/utk_graddiss/index.2.html), we can’t because it doesn’t accept any argument. The URL of the site is inside the function. This is called hard coding and it isn’t a good coding practice.
A better approach would be to create a function accepting the URL of the page we want to scrape as argument. It is actually really easy to modify the code to get this:
def get_dissertation_data(base_url):# Get main page and extract dissertation URLs main_page = requests.get(base_url) soup = BeautifulSoup(main_page.content, "html.parser")# Extract individual dissertation links links = [urljoin(base_url, a["href"])for a in soup.select(".article-listing a")] data = []for link in links:try:# Get individual dissertation page page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser")# Extract required fields date = page_soup.select_one("#publication_date p").text.strip() major = page_soup.select_one("#department p").text.strip() advisor = page_soup.select_one("#advisor1 p").text.strip() data.append({"Date": date,"Major": major,"Advisor": advisor })exceptExceptionas e:print(f"Error processing {link}: {str(e)}")return pl.DataFrame(data)
Now, if we want to use the function on that first page, we need to pass the URL as an argument:
df = get_dissertation_data(base_url)
You can verify that the code still works:
print(df)
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date ┆ Major ┆ Advisor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024 ┆ History ┆ Jay Rubenstein │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024 ┆ Business Administration ┆ Larry A. Fauver │
│ 8-2024 ┆ Electrical Engineering ┆ Dan Wilson │
│ 12-2024 ┆ Mechanical Engineering ┆ Prashant Singh │
│ … ┆ … ┆ … │
│ 5-2024 ┆ Mechanical Engineering ┆ Feng-Yuan Zhang │
│ 8-2024 ┆ Computer Science ┆ Scott Ruoti │
│ 8-2024 ┆ Mechanical Engineering ┆ Tony L. Schmitz │
│ 8-2024 ┆ Microbiology ┆ Shigetoshi Eda │
│ 8-2024 ┆ Energy Science and Engineering ┆ David C. Donovan │
└─────────┴─────────────────────────────────┴────────────────────────┘
The code looks very similar, but it now allows us to scrape the data from any page of the website.
Another improvement that we can make to the code is to add a little delay between requests because some sites will block requests if they are too frequent.
For this we need to load the time module:
import time
Then add time.sleep(0.1) in the loop:
def get_dissertation_data(base_url):# Get main page and extract dissertation URLs main_page = requests.get(base_url) soup = BeautifulSoup(main_page.content, "html.parser")# Extract individual dissertation links links = [urljoin(base_url, a["href"])for a in soup.select(".article-listing a")] data = []for link in links:try:# Get individual dissertation page page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser")# Extract required fields date = page_soup.select_one("#publication_date p").text.strip() major = page_soup.select_one("#department p").text.strip() advisor = page_soup.select_one("#advisor1 p").text.strip() data.append({"Date": date,"Major": major,"Advisor": advisor# Add 0.1 s between each request to the site time.sleep(0.1) })exceptExceptionas e:print(f"Error processing {link}: {str(e)}")return pl.DataFrame(data)
Save data to file
If you want to export the data and save it to a CSV file, you can do this: