def get_dissertation_data(BASE_URL):
# Get main page and extract dissertation URLs
= requests.get(BASE_URL)
main_page = BeautifulSoup(main_page.content, "html.parser")
soup
# Extract individual dissertation links
= [urljoin(BASE_URL, a["href"])
links for a in soup.select(".article-listing a")]
= []
data
for link in links[:10]: # Reduced to 10 for testing; remove slice for full 100
try:
# Get individual dissertation page
= requests.get(link)
page = BeautifulSoup(page.content, "html.parser")
page_soup
# Extract required fields
= page_soup.select_one("#publication_date p").text.strip()
date = page_soup.select_one("#department p").text.strip()
major = page_soup.select_one("#advisor1 p").text.strip()
advisor
data.append({"Date": date,
"Major": major,
"Advisor": advisor
})
except Exception as e:
print(f"Error processing {link}: {str(e)}")
return pl.DataFrame(data)
Getting the full data
Getting the full data
We can verify that it still works:
= get_dissertation_data(BASE_URL)
df print(df)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 1 ----> 1 df = get_dissertation_data(BASE_URL) 2 print(df) NameError: name 'BASE_URL' is not defined
Note that we now have to pass the argument BASE_URL
to the function.
Let’s look at the range
function to understand how it works:
for i in range(5):
print(i)
0
1
2
3
4
range(5)
is the same as range(0, 5)
. It goes from 0
—since Python starts indexing at 0
—and that left boundary is included to 4
because the right boundary (5
here) is excluded.
So range(116)
would go from 0
to 115
. You could verify it with:
for i in range(116):
print(i)
Your turn:
- We want numbers from
2
to117
, so what arguments do we need to pass to therange
function?
- How can you test it?
Applied to the series of webpages, that would be:
for i in range(2, 118):
print(f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html")
This is good, so let’s create a list with those webpages.
First, we initialize an empty list of the proper length (this makes the code much more efficient than forcing Python to perform dynamic memory allocation at each iteration of the loop):
= [None] * 116 url_list
Now we can fill in the list with the URLs with a loop:
for i in range(2, 118):
= f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html" url_list[i]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[5], line 2 1 for i in range(2, 118): ----> 2 url_list[i] = f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html" IndexError: list assignment index out of range
Let’s print our list to make sure that all is good:
print(url_list)
[None, None, 'https://trace.tennessee.edu/utk_graddiss/index.2.html', 'https://trace.tennessee.edu/utk_graddiss/index.3.html', 'https://trace.tennessee.edu/utk_graddiss/index.4.html', 'https://trace.tennessee.edu/utk_graddiss/index.5.html', 'https://trace.tennessee.edu/utk_graddiss/index.6.html', 'https://trace.tennessee.edu/utk_graddiss/index.7.html', 'https://trace.tennessee.edu/utk_graddiss/index.8.html', 'https://trace.tennessee.edu/utk_graddiss/index.9.html', 'https://trace.tennessee.edu/utk_graddiss/index.10.html', 'https://trace.tennessee.edu/utk_graddiss/index.11.html', 'https://trace.tennessee.edu/utk_graddiss/index.12.html', 'https://trace.tennessee.edu/utk_graddiss/index.13.html', 'https://trace.tennessee.edu/utk_graddiss/index.14.html', 'https://trace.tennessee.edu/utk_graddiss/index.15.html', 'https://trace.tennessee.edu/utk_graddiss/index.16.html', 'https://trace.tennessee.edu/utk_graddiss/index.17.html', 'https://trace.tennessee.edu/utk_graddiss/index.18.html', 'https://trace.tennessee.edu/utk_graddiss/index.19.html', 'https://trace.tennessee.edu/utk_graddiss/index.20.html', 'https://trace.tennessee.edu/utk_graddiss/index.21.html', 'https://trace.tennessee.edu/utk_graddiss/index.22.html', 'https://trace.tennessee.edu/utk_graddiss/index.23.html', 'https://trace.tennessee.edu/utk_graddiss/index.24.html', 'https://trace.tennessee.edu/utk_graddiss/index.25.html', 'https://trace.tennessee.edu/utk_graddiss/index.26.html', 'https://trace.tennessee.edu/utk_graddiss/index.27.html', 'https://trace.tennessee.edu/utk_graddiss/index.28.html', 'https://trace.tennessee.edu/utk_graddiss/index.29.html', 'https://trace.tennessee.edu/utk_graddiss/index.30.html', 'https://trace.tennessee.edu/utk_graddiss/index.31.html', 'https://trace.tennessee.edu/utk_graddiss/index.32.html', 'https://trace.tennessee.edu/utk_graddiss/index.33.html', 'https://trace.tennessee.edu/utk_graddiss/index.34.html', 'https://trace.tennessee.edu/utk_graddiss/index.35.html', 'https://trace.tennessee.edu/utk_graddiss/index.36.html', 'https://trace.tennessee.edu/utk_graddiss/index.37.html', 'https://trace.tennessee.edu/utk_graddiss/index.38.html', 'https://trace.tennessee.edu/utk_graddiss/index.39.html', 'https://trace.tennessee.edu/utk_graddiss/index.40.html', 'https://trace.tennessee.edu/utk_graddiss/index.41.html', 'https://trace.tennessee.edu/utk_graddiss/index.42.html', 'https://trace.tennessee.edu/utk_graddiss/index.43.html', 'https://trace.tennessee.edu/utk_graddiss/index.44.html', 'https://trace.tennessee.edu/utk_graddiss/index.45.html', 'https://trace.tennessee.edu/utk_graddiss/index.46.html', 'https://trace.tennessee.edu/utk_graddiss/index.47.html', 'https://trace.tennessee.edu/utk_graddiss/index.48.html', 'https://trace.tennessee.edu/utk_graddiss/index.49.html', 'https://trace.tennessee.edu/utk_graddiss/index.50.html', 'https://trace.tennessee.edu/utk_graddiss/index.51.html', 'https://trace.tennessee.edu/utk_graddiss/index.52.html', 'https://trace.tennessee.edu/utk_graddiss/index.53.html', 'https://trace.tennessee.edu/utk_graddiss/index.54.html', 'https://trace.tennessee.edu/utk_graddiss/index.55.html', 'https://trace.tennessee.edu/utk_graddiss/index.56.html', 'https://trace.tennessee.edu/utk_graddiss/index.57.html', 'https://trace.tennessee.edu/utk_graddiss/index.58.html', 'https://trace.tennessee.edu/utk_graddiss/index.59.html', 'https://trace.tennessee.edu/utk_graddiss/index.60.html', 'https://trace.tennessee.edu/utk_graddiss/index.61.html', 'https://trace.tennessee.edu/utk_graddiss/index.62.html', 'https://trace.tennessee.edu/utk_graddiss/index.63.html', 'https://trace.tennessee.edu/utk_graddiss/index.64.html', 'https://trace.tennessee.edu/utk_graddiss/index.65.html', 'https://trace.tennessee.edu/utk_graddiss/index.66.html', 'https://trace.tennessee.edu/utk_graddiss/index.67.html', 'https://trace.tennessee.edu/utk_graddiss/index.68.html', 'https://trace.tennessee.edu/utk_graddiss/index.69.html', 'https://trace.tennessee.edu/utk_graddiss/index.70.html', 'https://trace.tennessee.edu/utk_graddiss/index.71.html', 'https://trace.tennessee.edu/utk_graddiss/index.72.html', 'https://trace.tennessee.edu/utk_graddiss/index.73.html', 'https://trace.tennessee.edu/utk_graddiss/index.74.html', 'https://trace.tennessee.edu/utk_graddiss/index.75.html', 'https://trace.tennessee.edu/utk_graddiss/index.76.html', 'https://trace.tennessee.edu/utk_graddiss/index.77.html', 'https://trace.tennessee.edu/utk_graddiss/index.78.html', 'https://trace.tennessee.edu/utk_graddiss/index.79.html', 'https://trace.tennessee.edu/utk_graddiss/index.80.html', 'https://trace.tennessee.edu/utk_graddiss/index.81.html', 'https://trace.tennessee.edu/utk_graddiss/index.82.html', 'https://trace.tennessee.edu/utk_graddiss/index.83.html', 'https://trace.tennessee.edu/utk_graddiss/index.84.html', 'https://trace.tennessee.edu/utk_graddiss/index.85.html', 'https://trace.tennessee.edu/utk_graddiss/index.86.html', 'https://trace.tennessee.edu/utk_graddiss/index.87.html', 'https://trace.tennessee.edu/utk_graddiss/index.88.html', 'https://trace.tennessee.edu/utk_graddiss/index.89.html', 'https://trace.tennessee.edu/utk_graddiss/index.90.html', 'https://trace.tennessee.edu/utk_graddiss/index.91.html', 'https://trace.tennessee.edu/utk_graddiss/index.92.html', 'https://trace.tennessee.edu/utk_graddiss/index.93.html', 'https://trace.tennessee.edu/utk_graddiss/index.94.html', 'https://trace.tennessee.edu/utk_graddiss/index.95.html', 'https://trace.tennessee.edu/utk_graddiss/index.96.html', 'https://trace.tennessee.edu/utk_graddiss/index.97.html', 'https://trace.tennessee.edu/utk_graddiss/index.98.html', 'https://trace.tennessee.edu/utk_graddiss/index.99.html', 'https://trace.tennessee.edu/utk_graddiss/index.100.html', 'https://trace.tennessee.edu/utk_graddiss/index.101.html', 'https://trace.tennessee.edu/utk_graddiss/index.102.html', 'https://trace.tennessee.edu/utk_graddiss/index.103.html', 'https://trace.tennessee.edu/utk_graddiss/index.104.html', 'https://trace.tennessee.edu/utk_graddiss/index.105.html', 'https://trace.tennessee.edu/utk_graddiss/index.106.html', 'https://trace.tennessee.edu/utk_graddiss/index.107.html', 'https://trace.tennessee.edu/utk_graddiss/index.108.html', 'https://trace.tennessee.edu/utk_graddiss/index.109.html', 'https://trace.tennessee.edu/utk_graddiss/index.110.html', 'https://trace.tennessee.edu/utk_graddiss/index.111.html', 'https://trace.tennessee.edu/utk_graddiss/index.112.html', 'https://trace.tennessee.edu/utk_graddiss/index.113.html', 'https://trace.tennessee.edu/utk_graddiss/index.114.html', 'https://trace.tennessee.edu/utk_graddiss/index.115.html']