The internet is a trove of information. A lot of it is publicly available and thus suitable for use in research. Extracting that information and putting it in an organized format for analysis can however be extremely tedious.
Some websites have an API that makes it easy to extract information. These are websites that were built with the intention of being scraped (e.g. sites that contain databases, museums, art collections, etc.). When this is the case, this is definitely the way to go. Most websites however do not contain an API that can be queried.
Web scraping tools allow to automate parts of that process and Python is a popular language for the task.
Of note, an increasing number of websites use JavaScript to add cookies, interactivity, etc. to websites. This makes them a lot harder to scrape and require more sophisticated tools.
In this section, we will scrape a simple site that does not contain any JavaScript using the package Beautiful Soup.
We will use an LLM to help us in this process.
Background information
HTML and CSS
HyperText Markup Language (HTML) is the standard markup language for websites: it encodes the information related to the formatting and structure of webpages. Additionally, some of the customization can be stored in Cascading Style Sheets (CSS) files.
<a href="https://some.url">This is the text for a link</a>
Web scrapping
Web scraping is a general term for a set of tools which allow for the extraction of data from the web automatically.
While most of the data on the internet is publicly available, it is illegal to scrape some sites and you should always look into the policy of a site before attempting to scrape it. Some sites will also block you if you submit too many requests in a short amount of time, so remember to scrape responsibly.
Our goal is to scrape data from this site to produce a dataframe with the date, major, and advisor for each dissertation.
Let’s look at the site
First of all, let’s have a close look at the website we want to scrape to think carefully about what we want to do. Before starting to write code, it is always a good idea to think about what you are trying to achieve with your code.
To create a dataframe with the data for all the dissertations on that first page, we need to do two things:
Step 2: once we have the URLs, we want to scrape those pages too to get the date, major, and advisor for each dissertation.
Let’s try to scrape the data
Write Python code to scrape the following website: https://trace.tennessee.edu/utk_graddiss/index.html.
import requestsfrom bs4 import BeautifulSoupimport polars as plfrom urllib.parse import urljoinBASE_URL ="https://trace.tennessee.edu/utk_graddiss/index.html"def get_dissertation_data():# Get main page and extract dissertation URLs main_page = requests.get(BASE_URL) soup = BeautifulSoup(main_page.content, "html.parser")# Extract individual dissertation links links = [urljoin(BASE_URL, a["href"])for a in soup.select(".article-listing a")] data = []for link in links[:10]: # Reduced to 10 for testing; remove slice for full 100try:# Get individual dissertation page page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser")# Extract required fields date = page_soup.select_one("#publication_date p").text.strip() major = page_soup.select_one("#department p").text.strip() advisor = page_soup.select_one("#advisor1 p").text.strip() data.append({"Date": date,"Major": major,"Advisor": advisor })exceptExceptionas e:print(f"Error processing {link}: {str(e)}")return pl.DataFrame(data)# Usagedf = get_dissertation_data()print(df)
shape: (10, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date ┆ Major ┆ Advisor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024 ┆ History ┆ Jay Rubenstein │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024 ┆ Business Administration ┆ Larry A. Fauver │
│ 8-2024 ┆ Electrical Engineering ┆ Dan Wilson │
│ 12-2024 ┆ Mechanical Engineering ┆ Prashant Singh │
│ 5-2024 ┆ Chemical Engineering ┆ Steven M. Abel │
│ 12-2024 ┆ Industrial Engineering ┆ Anahita Khojandi │
│ 12-2024 ┆ Education ┆ Clara Lee Brown │
│ 8-2024 ┆ Geography ┆ Sally P Horn │
│ 12-2024 ┆ Education ┆ Enlida J Romero-Hall │
└─────────┴─────────────────────────────────┴────────────────────────┘
Your turn:
Did it work? Go to the site and verify the data.
What is the problem?
This works for the first 10 dissertations, but I want all of them.
import requestsfrom bs4 import BeautifulSoupimport polars as plfrom urllib.parse import urljoinimport timeBASE_URL ="https://trace.tennessee.edu/utk_graddiss/"def get_all_dissertation_links():"""Handles pagination to collect all dissertation links""" all_links = [] page_num =1whileTrue: page_url =f"{BASE_URL}index{page_num}.html"if page_num >1else BASE_URLtry: response = requests.get(page_url) response.raise_for_status() soup = BeautifulSoup(response.content, "html.parser") links = [urljoin(BASE_URL, a["href"])for a in soup.select(".article-listing a")]ifnot links:break all_links.extend(links) page_num +=1 time.sleep(1) # Be polite to the serverexceptExceptionas e:print(f"Stopped at page {page_num}: {str(e)}")breakreturn all_linksdef get_dissertation_data(link):"""Scrapes individual dissertation page with error handling"""try: page = requests.get(link, timeout=10) page.raise_for_status() soup = BeautifulSoup(page.content, "html.parser")return {"Date": soup.select_one("#publication_date p").text.strip(),"Major": soup.select_one("#department p").text.strip(),"Advisor": soup.select_one("#advisor1 p").text.strip(),"URL": link }exceptExceptionas e:print(f"Failed on {link}: {str(e)}")returnNonedef build_full_dataset():"""Main function to create complete dataset""" links = get_all_dissertation_links()print(f"Found {len(links)} dissertations to process") data = []for link in links: result = get_dissertation_data(link)if result: data.append(result) time.sleep(0.5) # Throttle requestsreturn pl.DataFrame(data).unique() # Remove duplicates# Usagefull_df = build_full_dataset()print(f"Final dataset contains {len(full_df)} records")full_df.head()
Stopped at page 2: 404 Client Error: Not Found for url: https://trace.tennessee.edu/utk_graddiss/index2.html
Found 100 dissertations to process
Final dataset contains 100 records
shape: (5, 4)
Date
Major
Advisor
URL
str
str
str
str
"5-2024"
"Business Administration"
"Matthew Serfling"
"https://trace.tennessee.edu/ut…
"8-2024"
"History"
"Jay Rubenstein"
"https://trace.tennessee.edu/ut…
"8-2024"
"Educational Psychology and Res…
"Mitsunori Misawa"
"https://trace.tennessee.edu/ut…
"8-2024"
"Physics"
"George Siopsis"
"https://trace.tennessee.edu/ut…
"5-2024"
"Industrial Engineering"
"Hoon Hwangbo"
"https://trace.tennessee.edu/ut…
Here, things fall apart. The first code worked, but didn’t scrape all the data. This one is just a mess.
So let’s go back to the first code and try to work through it. As you can see LLMs are impressive for easy tasks, but far from perfect at this point when things become more elaborate. But they can still truly help you get there.
Reduced to 10 for testing; remove slice for full 100
“What is a slice in Python?” “How do I remove the slice in the following Python code: for link in links[:10]:”
import requestsfrom bs4 import BeautifulSoupimport polars as plfrom urllib.parse import urljoinBASE_URL ="https://trace.tennessee.edu/utk_graddiss/index.html"def get_dissertation_data():# Get main page and extract dissertation URLs main_page = requests.get(BASE_URL) soup = BeautifulSoup(main_page.content, "html.parser")# Extract individual dissertation links links = [urljoin(BASE_URL, a["href"])for a in soup.select(".article-listing a")] data = []for link in links:try:# Get individual dissertation page page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser")# Extract required fields date = page_soup.select_one("#publication_date p").text.strip() major = page_soup.select_one("#department p").text.strip() advisor = page_soup.select_one("#advisor1 p").text.strip() data.append({"Date": date,"Major": major,"Advisor": advisor })exceptExceptionas e:print(f"Error processing {link}: {str(e)}")return pl.DataFrame(data)# Usagedf = get_dissertation_data()print(df)
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date ┆ Major ┆ Advisor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024 ┆ History ┆ Jay Rubenstein │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024 ┆ Business Administration ┆ Larry A. Fauver │
│ 8-2024 ┆ Electrical Engineering ┆ Dan Wilson │
│ 12-2024 ┆ Mechanical Engineering ┆ Prashant Singh │
│ … ┆ … ┆ … │
│ 5-2024 ┆ Mechanical Engineering ┆ Feng-Yuan Zhang │
│ 8-2024 ┆ Computer Science ┆ Scott Ruoti │
│ 8-2024 ┆ Mechanical Engineering ┆ Tony L. Schmitz │
│ 8-2024 ┆ Microbiology ┆ Shigetoshi Eda │
│ 8-2024 ┆ Energy Science and Engineering ┆ David C. Donovan │
└─────────┴─────────────────────────────────┴────────────────────────┘
import requestsfrom bs4 import BeautifulSoupimport polars as plfrom urllib.parse import urljoinimport timeBASE_URL ="https://trace.tennessee.edu/utk_graddiss/"FIRST_PAGE = BASE_URL +"index.html"def get_all_dissertation_links(): links = [] page_num =1whileTrue:if page_num ==1: page_url = FIRST_PAGEelse: page_url =f"{BASE_URL}index{page_num}.html"print(f"Scraping: {page_url}") response = requests.get(page_url)if response.status_code !=200:break# No more pages soup = BeautifulSoup(response.content, "html.parser") page_links = [urljoin(page_url, a["href"])for a in soup.select(".article-listing a")]ifnot page_links:break# No more dissertation links found links.extend(page_links) page_num +=1 time.sleep(0.5) # Be polite to the serverreturn linksdef get_dissertation_data(): links = get_all_dissertation_links() data = []for link in links:try: page = requests.get(link) page_soup = BeautifulSoup(page.content, "html.parser") date = page_soup.select_one("#publication_date p") major = page_soup.select_one("#department p") advisor = page_soup.select_one("#advisor1 p")# Handle missing fields gracefully date = date.text.strip() if date else"" major = major.text.strip() if major else"" advisor = advisor.text.strip() if advisor else"" data.append({"Date": date,"Major": major,"Advisor": advisor })exceptExceptionas e:print(f"Error processing {link}: {str(e)}") time.sleep(0.2) # Be polite to the serverreturn pl.DataFrame(data)# Usagedf = get_dissertation_data()print(df)
Scraping: https://trace.tennessee.edu/utk_graddiss/index.html
Scraping: https://trace.tennessee.edu/utk_graddiss/index2.html
shape: (100, 3)
┌─────────┬─────────────────────────────────┬────────────────────────┐
│ Date ┆ Major ┆ Advisor │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════════════════════════════╪════════════════════════╡
│ 8-2024 ┆ History ┆ Jay Rubenstein │
│ 12-2024 ┆ Biochemistry and Cellular and … ┆ Dr. Rebecca A. Prosser │
│ 5-2024 ┆ Business Administration ┆ Larry A. Fauver │
│ 8-2024 ┆ Electrical Engineering ┆ Dan Wilson │
│ 12-2024 ┆ Mechanical Engineering ┆ Prashant Singh │
│ … ┆ … ┆ … │
│ 5-2024 ┆ Mechanical Engineering ┆ Feng-Yuan Zhang │
│ 8-2024 ┆ Computer Science ┆ Scott Ruoti │
│ 8-2024 ┆ Mechanical Engineering ┆ Tony L. Schmitz │
│ 8-2024 ┆ Microbiology ┆ Shigetoshi Eda │
│ 8-2024 ┆ Energy Science and Engineering ┆ David C. Donovan │
└─────────┴─────────────────────────────────┴────────────────────────┘