def get_dissertation_data(BASE_URL):
# Get main page and extract dissertation URLs
main_page = requests.get(BASE_URL)
soup = BeautifulSoup(main_page.content, "html.parser")
# Extract individual dissertation links
links = [urljoin(BASE_URL, a["href"])
for a in soup.select(".article-listing a")]
data = []
for link in links[:10]: # Reduced to 10 for testing; remove slice for full 100
try:
# Get individual dissertation page
page = requests.get(link)
page_soup = BeautifulSoup(page.content, "html.parser")
# Extract required fields
date = page_soup.select_one("#publication_date p").text.strip()
major = page_soup.select_one("#department p").text.strip()
advisor = page_soup.select_one("#advisor1 p").text.strip()
data.append({
"Date": date,
"Major": major,
"Advisor": advisor
})
except Exception as e:
print(f"Error processing {link}: {str(e)}")
return pl.DataFrame(data)Getting the full data
Getting the full data
We can verify that it still works:
df = get_dissertation_data(BASE_URL)
print(df)--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 1 ----> 1 df = get_dissertation_data(BASE_URL) 2 print(df) NameError: name 'BASE_URL' is not defined
Note that we now have to pass the argument BASE_URL to the function.
Let’s look at the range function to understand how it works:
for i in range(5):
print(i)0
1
2
3
4
range(5) is the same as range(0, 5). It goes from 0—since Python starts indexing at 0—and that left boundary is included to 4 because the right boundary (5 here) is excluded.
So range(116) would go from 0 to 115. You could verify it with:
for i in range(116):
print(i)Your turn:
- We want numbers from
2to117, so what arguments do we need to pass to therangefunction?
- How can you test it?
Applied to the series of webpages, that would be:
for i in range(2, 118):
print(f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html")This is good, so let’s create a list with those webpages.
First, we initialize an empty list of the proper length (this makes the code much more efficient than forcing Python to perform dynamic memory allocation at each iteration of the loop):
url_list = [None] * 116Now we can fill in the list with the URLs with a loop:
for i in range(2, 118):
url_list[i] = f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html"--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[5], line 2 1 for i in range(2, 118): ----> 2 url_list[i] = f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html" IndexError: list assignment index out of range
Let’s print our list to make sure that all is good:
print(url_list)[None, None, 'https://trace.tennessee.edu/utk_graddiss/index.2.html', 'https://trace.tennessee.edu/utk_graddiss/index.3.html', 'https://trace.tennessee.edu/utk_graddiss/index.4.html', 'https://trace.tennessee.edu/utk_graddiss/index.5.html', 'https://trace.tennessee.edu/utk_graddiss/index.6.html', 'https://trace.tennessee.edu/utk_graddiss/index.7.html', 'https://trace.tennessee.edu/utk_graddiss/index.8.html', 'https://trace.tennessee.edu/utk_graddiss/index.9.html', 'https://trace.tennessee.edu/utk_graddiss/index.10.html', 'https://trace.tennessee.edu/utk_graddiss/index.11.html', 'https://trace.tennessee.edu/utk_graddiss/index.12.html', 'https://trace.tennessee.edu/utk_graddiss/index.13.html', 'https://trace.tennessee.edu/utk_graddiss/index.14.html', 'https://trace.tennessee.edu/utk_graddiss/index.15.html', 'https://trace.tennessee.edu/utk_graddiss/index.16.html', 'https://trace.tennessee.edu/utk_graddiss/index.17.html', 'https://trace.tennessee.edu/utk_graddiss/index.18.html', 'https://trace.tennessee.edu/utk_graddiss/index.19.html', 'https://trace.tennessee.edu/utk_graddiss/index.20.html', 'https://trace.tennessee.edu/utk_graddiss/index.21.html', 'https://trace.tennessee.edu/utk_graddiss/index.22.html', 'https://trace.tennessee.edu/utk_graddiss/index.23.html', 'https://trace.tennessee.edu/utk_graddiss/index.24.html', 'https://trace.tennessee.edu/utk_graddiss/index.25.html', 'https://trace.tennessee.edu/utk_graddiss/index.26.html', 'https://trace.tennessee.edu/utk_graddiss/index.27.html', 'https://trace.tennessee.edu/utk_graddiss/index.28.html', 'https://trace.tennessee.edu/utk_graddiss/index.29.html', 'https://trace.tennessee.edu/utk_graddiss/index.30.html', 'https://trace.tennessee.edu/utk_graddiss/index.31.html', 'https://trace.tennessee.edu/utk_graddiss/index.32.html', 'https://trace.tennessee.edu/utk_graddiss/index.33.html', 'https://trace.tennessee.edu/utk_graddiss/index.34.html', 'https://trace.tennessee.edu/utk_graddiss/index.35.html', 'https://trace.tennessee.edu/utk_graddiss/index.36.html', 'https://trace.tennessee.edu/utk_graddiss/index.37.html', 'https://trace.tennessee.edu/utk_graddiss/index.38.html', 'https://trace.tennessee.edu/utk_graddiss/index.39.html', 'https://trace.tennessee.edu/utk_graddiss/index.40.html', 'https://trace.tennessee.edu/utk_graddiss/index.41.html', 'https://trace.tennessee.edu/utk_graddiss/index.42.html', 'https://trace.tennessee.edu/utk_graddiss/index.43.html', 'https://trace.tennessee.edu/utk_graddiss/index.44.html', 'https://trace.tennessee.edu/utk_graddiss/index.45.html', 'https://trace.tennessee.edu/utk_graddiss/index.46.html', 'https://trace.tennessee.edu/utk_graddiss/index.47.html', 'https://trace.tennessee.edu/utk_graddiss/index.48.html', 'https://trace.tennessee.edu/utk_graddiss/index.49.html', 'https://trace.tennessee.edu/utk_graddiss/index.50.html', 'https://trace.tennessee.edu/utk_graddiss/index.51.html', 'https://trace.tennessee.edu/utk_graddiss/index.52.html', 'https://trace.tennessee.edu/utk_graddiss/index.53.html', 'https://trace.tennessee.edu/utk_graddiss/index.54.html', 'https://trace.tennessee.edu/utk_graddiss/index.55.html', 'https://trace.tennessee.edu/utk_graddiss/index.56.html', 'https://trace.tennessee.edu/utk_graddiss/index.57.html', 'https://trace.tennessee.edu/utk_graddiss/index.58.html', 'https://trace.tennessee.edu/utk_graddiss/index.59.html', 'https://trace.tennessee.edu/utk_graddiss/index.60.html', 'https://trace.tennessee.edu/utk_graddiss/index.61.html', 'https://trace.tennessee.edu/utk_graddiss/index.62.html', 'https://trace.tennessee.edu/utk_graddiss/index.63.html', 'https://trace.tennessee.edu/utk_graddiss/index.64.html', 'https://trace.tennessee.edu/utk_graddiss/index.65.html', 'https://trace.tennessee.edu/utk_graddiss/index.66.html', 'https://trace.tennessee.edu/utk_graddiss/index.67.html', 'https://trace.tennessee.edu/utk_graddiss/index.68.html', 'https://trace.tennessee.edu/utk_graddiss/index.69.html', 'https://trace.tennessee.edu/utk_graddiss/index.70.html', 'https://trace.tennessee.edu/utk_graddiss/index.71.html', 'https://trace.tennessee.edu/utk_graddiss/index.72.html', 'https://trace.tennessee.edu/utk_graddiss/index.73.html', 'https://trace.tennessee.edu/utk_graddiss/index.74.html', 'https://trace.tennessee.edu/utk_graddiss/index.75.html', 'https://trace.tennessee.edu/utk_graddiss/index.76.html', 'https://trace.tennessee.edu/utk_graddiss/index.77.html', 'https://trace.tennessee.edu/utk_graddiss/index.78.html', 'https://trace.tennessee.edu/utk_graddiss/index.79.html', 'https://trace.tennessee.edu/utk_graddiss/index.80.html', 'https://trace.tennessee.edu/utk_graddiss/index.81.html', 'https://trace.tennessee.edu/utk_graddiss/index.82.html', 'https://trace.tennessee.edu/utk_graddiss/index.83.html', 'https://trace.tennessee.edu/utk_graddiss/index.84.html', 'https://trace.tennessee.edu/utk_graddiss/index.85.html', 'https://trace.tennessee.edu/utk_graddiss/index.86.html', 'https://trace.tennessee.edu/utk_graddiss/index.87.html', 'https://trace.tennessee.edu/utk_graddiss/index.88.html', 'https://trace.tennessee.edu/utk_graddiss/index.89.html', 'https://trace.tennessee.edu/utk_graddiss/index.90.html', 'https://trace.tennessee.edu/utk_graddiss/index.91.html', 'https://trace.tennessee.edu/utk_graddiss/index.92.html', 'https://trace.tennessee.edu/utk_graddiss/index.93.html', 'https://trace.tennessee.edu/utk_graddiss/index.94.html', 'https://trace.tennessee.edu/utk_graddiss/index.95.html', 'https://trace.tennessee.edu/utk_graddiss/index.96.html', 'https://trace.tennessee.edu/utk_graddiss/index.97.html', 'https://trace.tennessee.edu/utk_graddiss/index.98.html', 'https://trace.tennessee.edu/utk_graddiss/index.99.html', 'https://trace.tennessee.edu/utk_graddiss/index.100.html', 'https://trace.tennessee.edu/utk_graddiss/index.101.html', 'https://trace.tennessee.edu/utk_graddiss/index.102.html', 'https://trace.tennessee.edu/utk_graddiss/index.103.html', 'https://trace.tennessee.edu/utk_graddiss/index.104.html', 'https://trace.tennessee.edu/utk_graddiss/index.105.html', 'https://trace.tennessee.edu/utk_graddiss/index.106.html', 'https://trace.tennessee.edu/utk_graddiss/index.107.html', 'https://trace.tennessee.edu/utk_graddiss/index.108.html', 'https://trace.tennessee.edu/utk_graddiss/index.109.html', 'https://trace.tennessee.edu/utk_graddiss/index.110.html', 'https://trace.tennessee.edu/utk_graddiss/index.111.html', 'https://trace.tennessee.edu/utk_graddiss/index.112.html', 'https://trace.tennessee.edu/utk_graddiss/index.113.html', 'https://trace.tennessee.edu/utk_graddiss/index.114.html', 'https://trace.tennessee.edu/utk_graddiss/index.115.html']