Webscraping with an LLM

Author

Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.

Some websites have an API that makes it easy to extract information. These are websites that were built with the intention of being scraped (e.g. sites that contain databases, museums, art collections, etc.). When this is the case, this is definitely the way to go. Most websites however do not contain an API that can be queried.

Web scraping tools allow to automate parts of that process and Python is a popular language for the task.

Of note, an increasing number of websites use JavaScript to add cookies, interactivity, etc. to websites. This makes them a lot harder to scrape and require more sophisticated tools.

In this section, we will scrape a simple site that does not contain any JavaScript using the package Beautiful Soup.

We will use an LLM to help us in this process.

Background information

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>

Formatting:

  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.

Webscraping example

We will use a website from the University of Tennessee containing a database of PhD theses from that university which uses the Digital Commons Network.

Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation. Scraping any other site which uses the Digital Commons Network can be done following a similar workflow.

Site exploration

First of all, let’s have a close look at the website we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.

To see how the website is built, you can open the web inspector. Most browsers allow to right-click on an element and select “inspect” in the menu. Alternatively, Chrome opens the web inspector with F12 and Firefox and Edge with Ctrl+Shift+I (Windows, Linux) or Command+Option+I (macOS).

To create a dataframe with the data for all the dissertations on the first page, we need to do two things:

  • Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.

  • Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.

Note that we are only scraping the first page (with the 100 most recent theses) to keep things simple. If you wanted to get all the data from the site, you would have to create a loop and repeat this over all 117 pages of the site.

LLM code

Let’s use an LLM to get us started.

Here is the prompt I used:

Write Python code to scrape the following website: https://trace.tennessee.edu/utk_graddiss/index.html.

As the end result, I want to have a polars dataframe with the following columns:

  • Date
  • Major of the thesis
  • Advisor (PI)
import requests
from bs4 import BeautifulSoup
import polars as pl
from urllib.parse import urljoin

base_url = "https://trace.tennessee.edu/utk_graddiss/index.html"

def get_dissertation_data():
    # Get main page and extract dissertation URLs
    main_page = requests.get(base_url)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(base_url, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links[:10]:  # Reduced to 10 for testing; remove slice for full 100
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

# Usage
df = get_dissertation_data()
print(df)
shape: (10, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date    ┆ Major                           ┆ Advisor                │
│ ---     ┆ ---                             ┆ ---                    │
│ str     ┆ str                             ┆ str                    │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024  ┆ History                         ┆ Jay Rubenstein         │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024  ┆ Business Administration         ┆ Larry A. Fauver        │
│ 8-2024  ┆ Electrical Engineering          ┆ Dan Wilson             │
│ 12-2024 ┆ Mechanical Engineering          ┆ Prashant Singh         │
│ 5-2024  ┆ Chemical Engineering            ┆ Steven M. Abel         │
│ 12-2024 ┆ Industrial Engineering          ┆ Anahita Khojandi       │
│ 12-2024 ┆ Education                       ┆ Clara Lee Brown        │
│ 8-2024  ┆ Geography                       ┆ Sally P Horn           │
│ 12-2024 ┆ Education                       ┆ Enlida J Romero-Hall   │
└─────────┴─────────────────────────────────┴────────────────────────┘

The package Beautiful Soup—loaded in Python as bs4—transforms (parses) HTML data into a parse tree, which makes extracting information easier.

Your turn:

  • Did it work? Go to the site and verify the data.

  • What is the problem? Can you fix it?

Code improvements

LLMs can be very helpful in getting you started, but you will often have to tweak the code to improve it—even when it works.

We now have code that works, but its downside is that we created a function that can only work on one webpage… so it is of limited use. If, for instance, we wanted to apply that function to the second page of the site (https://trace.tennessee.edu/utk_graddiss/index.2.html), we can’t because it doesn’t accept any argument. The URL of the site is inside the function. This is called hard coding and it isn’t a good coding practice.

A better approach would be to create a function accepting the URL of the page we want to scrape as argument. It is actually really easy to modify the code to get this:

def get_dissertation_data(base_url):
    # Get main page and extract dissertation URLs
    main_page = requests.get(base_url)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(base_url, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links:
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

Now, if we want to use the function on that first page, we need to pass the URL as an argument:

df = get_dissertation_data(base_url)

You can verify that the code still works:

print(df)
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date    ┆ Major                           ┆ Advisor                │
│ ---     ┆ ---                             ┆ ---                    │
│ str     ┆ str                             ┆ str                    │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024  ┆ History                         ┆ Jay Rubenstein         │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024  ┆ Business Administration         ┆ Larry A. Fauver        │
│ 8-2024  ┆ Electrical Engineering          ┆ Dan Wilson             │
│ 12-2024 ┆ Mechanical Engineering          ┆ Prashant Singh         │
│ …       ┆ …                               ┆ …                      │
│ 5-2024  ┆ Mechanical Engineering          ┆ Feng-Yuan Zhang        │
│ 8-2024  ┆ Computer Science                ┆ Scott Ruoti            │
│ 8-2024  ┆ Mechanical Engineering          ┆ Tony L. Schmitz        │
│ 8-2024  ┆ Microbiology                    ┆ Shigetoshi Eda         │
│ 8-2024  ┆ Energy Science and Engineering  ┆ David C. Donovan       │
└─────────┴─────────────────────────────────┴────────────────────────┘

The code looks very similar, but it now allows us to scrape the data from any page of the website.

Your turn:

Scrape the data from the sixth page (https://trace.tennessee.edu/utk_graddiss/index.6.html).

Another improvement that we can make to the code is to add a little delay between requests because some sites will block requests if they are too frequent.

For this we need to load the time module:

import time

Then add time.sleep(0.1) in the loop:

def get_dissertation_data(base_url):
    # Get main page and extract dissertation URLs
    main_page = requests.get(base_url)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(base_url, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links:
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor

            # Add 0.1 s between each request to the site
            time.sleep(0.1)
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

Save data to file

If you want to export the data and save it to a CSV file, you can do this:

df.write_csv("dissertations_data.csv")