Webscraping with an LLM

Author

Marie-Hélène Burle

The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.

Some websites have an API that makes it easy to extract information. These are websites that were built with the intention of being scraped (e.g. sites that contain databases, museums, art collections, etc.). When this is the case, this is definitely the way to go. Most websites however do not contain an API that can be queried.

Web scraping tools allow to automate parts of that process and Python is a popular language for the task.

Of note, an increasing number of websites use JavaScript to add cookies, interactivity, etc. to websites. This makes them a lot harder to scrape and require more sophisticated tools.

In this section, we will scrape a simple site that does not contain any JavaScript using the package Beautiful Soup.

We will use an LLM to help us in this process.

Background information

HTML and CSS

HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.

HTML uses tags of the form:

<some_tag>Your content</some_tag>

Some tags have attributes:

<some_tag attribute_name="attribute value">Your content</some_tag>

Examples:

Site structure:

  • <h2>This is a heading of level 2</h2>
  • <p>This is a paragraph</p>

Formatting:

  • <b>This is bold</b>
  • <a href="https://some.url">This is the text for a link</a>

Web scrapping

Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.

While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.

Example for this workshop

We will use a website from the University of Tennessee containing a database of PhD theses from that university.

Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation.

Let’s look at the site

First of all, let’s have a close look at the website we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.

To create a dataframe with the data for all the dissertations on that first page, we need to do two things:

  • Step 1: from the dissertations database first page, we want to scrape the list of URLs for the dissertation pages.

  • Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.

Let’s try to scrape the data

Write Python code to scrape the following website: https://trace.tennessee.edu/utk_graddiss/index.html.

import requests
from bs4 import BeautifulSoup
import polars as pl
from urllib.parse import urljoin

BASE_URL = "https://trace.tennessee.edu/utk_graddiss/index.html"

def get_dissertation_data():
    # Get main page and extract dissertation URLs
    main_page = requests.get(BASE_URL)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(BASE_URL, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links[:10]:  # Reduced to 10 for testing; remove slice for full 100
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

# Usage
df = get_dissertation_data()
print(df)
shape: (10, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date    ┆ Major                           ┆ Advisor                │
│ ---     ┆ ---                             ┆ ---                    │
│ str     ┆ str                             ┆ str                    │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024  ┆ History                         ┆ Jay Rubenstein         │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024  ┆ Business Administration         ┆ Larry A. Fauver        │
│ 8-2024  ┆ Electrical Engineering          ┆ Dan Wilson             │
│ 12-2024 ┆ Mechanical Engineering          ┆ Prashant Singh         │
│ 5-2024  ┆ Chemical Engineering            ┆ Steven M. Abel         │
│ 12-2024 ┆ Industrial Engineering          ┆ Anahita Khojandi       │
│ 12-2024 ┆ Education                       ┆ Clara Lee Brown        │
│ 8-2024  ┆ Geography                       ┆ Sally P Horn           │
│ 12-2024 ┆ Education                       ┆ Enlida J Romero-Hall   │
└─────────┴─────────────────────────────────┴────────────────────────┘

Your turn:

  • Did it work? Go to the site and verify the data.

  • What is the problem?

This works for the first 10 dissertations, but I want all of them.

import requests
from bs4 import BeautifulSoup
import polars as pl
from urllib.parse import urljoin
import time

BASE_URL = "https://trace.tennessee.edu/utk_graddiss/"

def get_all_dissertation_links():
    """Handles pagination to collect all dissertation links"""
    all_links = []
    page_num = 1

    while True:
        page_url = f"{BASE_URL}index{page_num}.html" if page_num > 1 else BASE_URL
        try:
            response = requests.get(page_url)
            response.raise_for_status()

            soup = BeautifulSoup(response.content, "html.parser")
            links = [urljoin(BASE_URL, a["href"])
                    for a in soup.select(".article-listing a")]

            if not links:
                break

            all_links.extend(links)
            page_num += 1
            time.sleep(1)  # Be polite to the server

        except Exception as e:
            print(f"Stopped at page {page_num}: {str(e)}")
            break

    return all_links

def get_dissertation_data(link):
    """Scrapes individual dissertation page with error handling"""
    try:
        page = requests.get(link, timeout=10)
        page.raise_for_status()
        soup = BeautifulSoup(page.content, "html.parser")

        return {
            "Date": soup.select_one("#publication_date p").text.strip(),
            "Major": soup.select_one("#department p").text.strip(),
            "Advisor": soup.select_one("#advisor1 p").text.strip(),
            "URL": link
        }

    except Exception as e:
        print(f"Failed on {link}: {str(e)}")
        return None

def build_full_dataset():
    """Main function to create complete dataset"""
    links = get_all_dissertation_links()
    print(f"Found {len(links)} dissertations to process")

    data = []
    for link in links:
        result = get_dissertation_data(link)
        if result:
            data.append(result)
        time.sleep(0.5)  # Throttle requests

    return pl.DataFrame(data).unique()  # Remove duplicates

# Usage
full_df = build_full_dataset()
print(f"Final dataset contains {len(full_df)} records")
full_df.head()
Stopped at page 2: 404 Client Error: Not Found for url: https://trace.tennessee.edu/utk_graddiss/index2.html
Found 100 dissertations to process
Final dataset contains 100 records
shape: (5, 4)
Date Major Advisor URL
str str str str
"5-2024" "Business Administration" "Matthew Serfling" "https://trace.tennessee.edu/ut…
"8-2024" "History" "Jay Rubenstein" "https://trace.tennessee.edu/ut…
"8-2024" "Educational Psychology and Res… "Mitsunori Misawa" "https://trace.tennessee.edu/ut…
"8-2024" "Physics" "George Siopsis" "https://trace.tennessee.edu/ut…
"5-2024" "Industrial Engineering" "Hoon Hwangbo" "https://trace.tennessee.edu/ut…

Here, things fall apart. The first code worked, but didn’t scrape all the data. This one is just a mess.

So let’s go back to the first code and try to work through it. As you can see LLMs are impressive for easy tasks, but far from perfect at this point when things become more elaborate. But they can still truly help you get there.

Reduced to 10 for testing; remove slice for full 100

“What is a slice in Python?” “How do I remove the slice in the following Python code: for link in links[:10]:

import requests
from bs4 import BeautifulSoup
import polars as pl
from urllib.parse import urljoin

BASE_URL = "https://trace.tennessee.edu/utk_graddiss/index.html"

def get_dissertation_data():
    # Get main page and extract dissertation URLs
    main_page = requests.get(BASE_URL)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(BASE_URL, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links:
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

# Usage
df = get_dissertation_data()
print(df)
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date    ┆ Major                           ┆ Advisor                │
│ ---     ┆ ---                             ┆ ---                    │
│ str     ┆ str                             ┆ str                    │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024  ┆ History                         ┆ Jay Rubenstein         │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024  ┆ Business Administration         ┆ Larry A. Fauver        │
│ 8-2024  ┆ Electrical Engineering          ┆ Dan Wilson             │
│ 12-2024 ┆ Mechanical Engineering          ┆ Prashant Singh         │
│ …       ┆ …                               ┆ …                      │
│ 5-2024  ┆ Mechanical Engineering          ┆ Feng-Yuan Zhang        │
│ 8-2024  ┆ Computer Science                ┆ Scott Ruoti            │
│ 8-2024  ┆ Mechanical Engineering          ┆ Tony L. Schmitz        │
│ 8-2024  ┆ Microbiology                    ┆ Shigetoshi Eda         │
│ 8-2024  ┆ Energy Science and Engineering  ┆ David C. Donovan       │
└─────────┴─────────────────────────────────┴────────────────────────┘
import requests
from bs4 import BeautifulSoup
import polars as pl
from urllib.parse import urljoin
import time

BASE_URL = "https://trace.tennessee.edu/utk_graddiss/"
FIRST_PAGE = BASE_URL + "index.html"

def get_all_dissertation_links():
    links = []
    page_num = 1

    while True:
        if page_num == 1:
            page_url = FIRST_PAGE
        else:
            page_url = f"{BASE_URL}index{page_num}.html"

        print(f"Scraping: {page_url}")
        response = requests.get(page_url)
        if response.status_code != 200:
            break  # No more pages

        soup = BeautifulSoup(response.content, "html.parser")
        page_links = [urljoin(page_url, a["href"])
                      for a in soup.select(".article-listing a")]

        if not page_links:
            break  # No more dissertation links found

        links.extend(page_links)
        page_num += 1
        time.sleep(0.5)  # Be polite to the server

    return links

def get_dissertation_data():
    links = get_all_dissertation_links()
    data = []

    for link in links:
        try:
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            date = page_soup.select_one("#publication_date p")
            major = page_soup.select_one("#department p")
            advisor = page_soup.select_one("#advisor1 p")

            # Handle missing fields gracefully
            date = date.text.strip() if date else ""
            major = major.text.strip() if major else ""
            advisor = advisor.text.strip() if advisor else ""

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")
        time.sleep(0.2)  # Be polite to the server

    return pl.DataFrame(data)

# Usage
df = get_dissertation_data()
print(df)
Scraping: https://trace.tennessee.edu/utk_graddiss/index.html
Scraping: https://trace.tennessee.edu/utk_graddiss/index2.html
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date    ┆ Major                           ┆ Advisor                │
│ ---     ┆ ---                             ┆ ---                    │
│ str     ┆ str                             ┆ str                    │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024  ┆ History                         ┆ Jay Rubenstein         │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024  ┆ Business Administration         ┆ Larry A. Fauver        │
│ 8-2024  ┆ Electrical Engineering          ┆ Dan Wilson             │
│ 12-2024 ┆ Mechanical Engineering          ┆ Prashant Singh         │
│ …       ┆ …                               ┆ …                      │
│ 5-2024  ┆ Mechanical Engineering          ┆ Feng-Yuan Zhang        │
│ 8-2024  ┆ Computer Science                ┆ Scott Ruoti            │
│ 8-2024  ┆ Mechanical Engineering          ┆ Tony L. Schmitz        │
│ 8-2024  ┆ Microbiology                    ┆ Shigetoshi Eda         │
│ 8-2024  ┆ Energy Science and Engineering  ┆ David C. Donovan       │
└─────────┴─────────────────────────────────┴────────────────────────┘