Getting the full data

Author

Marie-Hélène Burle

Getting the full data

def get_dissertation_data(BASE_URL):
    # Get main page and extract dissertation URLs
    main_page = requests.get(BASE_URL)
    soup = BeautifulSoup(main_page.content, "html.parser")

    # Extract individual dissertation links
    links = [urljoin(BASE_URL, a["href"])
             for a in soup.select(".article-listing a")]

    data = []

    for link in links[:10]:  # Reduced to 10 for testing; remove slice for full 100
        try:
            # Get individual dissertation page
            page = requests.get(link)
            page_soup = BeautifulSoup(page.content, "html.parser")

            # Extract required fields
            date = page_soup.select_one("#publication_date p").text.strip()
            major = page_soup.select_one("#department p").text.strip()
            advisor = page_soup.select_one("#advisor1 p").text.strip()

            data.append({
                "Date": date,
                "Major": major,
                "Advisor": advisor
            })

        except Exception as e:
            print(f"Error processing {link}: {str(e)}")

    return pl.DataFrame(data)

We can verify that it still works:

df = get_dissertation_data(BASE_URL)
print(df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 df = get_dissertation_data(BASE_URL)
      2 print(df)

NameError: name 'BASE_URL' is not defined

Note that we now have to pass the argument BASE_URL to the function.

Let’s look at the range function to understand how it works:

for i in range(5):
    print(i)
0
1
2
3
4

range(5) is the same as range(0, 5). It goes from 0—since Python starts indexing at 0—and that left boundary is included to 4 because the right boundary (5 here) is excluded.

So range(116) would go from 0 to 115. You could verify it with:

for i in range(116):
    print(i)

Your turn:

  • We want numbers from 2 to 117, so what arguments do we need to pass to the range function?
  • How can you test it?

Applied to the series of webpages, that would be:

for i in range(2, 118):
    print(f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html")

This is good, so let’s create a list with those webpages.

First, we initialize an empty list of the proper length (this makes the code much more efficient than forcing Python to perform dynamic memory allocation at each iteration of the loop):

url_list = [None] * 116

Now we can fill in the list with the URLs with a loop:

for i in range(2, 118):
    url_list[i] = f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html"
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[5], line 2
      1 for i in range(2, 118):
----> 2     url_list[i] = f"https://trace.tennessee.edu/utk_graddiss/index.{i}.html"

IndexError: list assignment index out of range

Let’s print our list to make sure that all is good:

print(url_list)
[None, None, 'https://trace.tennessee.edu/utk_graddiss/index.2.html', 'https://trace.tennessee.edu/utk_graddiss/index.3.html', 'https://trace.tennessee.edu/utk_graddiss/index.4.html', 'https://trace.tennessee.edu/utk_graddiss/index.5.html', 'https://trace.tennessee.edu/utk_graddiss/index.6.html', 'https://trace.tennessee.edu/utk_graddiss/index.7.html', 'https://trace.tennessee.edu/utk_graddiss/index.8.html', 'https://trace.tennessee.edu/utk_graddiss/index.9.html', 'https://trace.tennessee.edu/utk_graddiss/index.10.html', 'https://trace.tennessee.edu/utk_graddiss/index.11.html', 'https://trace.tennessee.edu/utk_graddiss/index.12.html', 'https://trace.tennessee.edu/utk_graddiss/index.13.html', 'https://trace.tennessee.edu/utk_graddiss/index.14.html', 'https://trace.tennessee.edu/utk_graddiss/index.15.html', 'https://trace.tennessee.edu/utk_graddiss/index.16.html', 'https://trace.tennessee.edu/utk_graddiss/index.17.html', 'https://trace.tennessee.edu/utk_graddiss/index.18.html', 'https://trace.tennessee.edu/utk_graddiss/index.19.html', 'https://trace.tennessee.edu/utk_graddiss/index.20.html', 'https://trace.tennessee.edu/utk_graddiss/index.21.html', 'https://trace.tennessee.edu/utk_graddiss/index.22.html', 'https://trace.tennessee.edu/utk_graddiss/index.23.html', 'https://trace.tennessee.edu/utk_graddiss/index.24.html', 'https://trace.tennessee.edu/utk_graddiss/index.25.html', 'https://trace.tennessee.edu/utk_graddiss/index.26.html', 'https://trace.tennessee.edu/utk_graddiss/index.27.html', 'https://trace.tennessee.edu/utk_graddiss/index.28.html', 'https://trace.tennessee.edu/utk_graddiss/index.29.html', 'https://trace.tennessee.edu/utk_graddiss/index.30.html', 'https://trace.tennessee.edu/utk_graddiss/index.31.html', 'https://trace.tennessee.edu/utk_graddiss/index.32.html', 'https://trace.tennessee.edu/utk_graddiss/index.33.html', 'https://trace.tennessee.edu/utk_graddiss/index.34.html', 'https://trace.tennessee.edu/utk_graddiss/index.35.html', 'https://trace.tennessee.edu/utk_graddiss/index.36.html', 'https://trace.tennessee.edu/utk_graddiss/index.37.html', 'https://trace.tennessee.edu/utk_graddiss/index.38.html', 'https://trace.tennessee.edu/utk_graddiss/index.39.html', 'https://trace.tennessee.edu/utk_graddiss/index.40.html', 'https://trace.tennessee.edu/utk_graddiss/index.41.html', 'https://trace.tennessee.edu/utk_graddiss/index.42.html', 'https://trace.tennessee.edu/utk_graddiss/index.43.html', 'https://trace.tennessee.edu/utk_graddiss/index.44.html', 'https://trace.tennessee.edu/utk_graddiss/index.45.html', 'https://trace.tennessee.edu/utk_graddiss/index.46.html', 'https://trace.tennessee.edu/utk_graddiss/index.47.html', 'https://trace.tennessee.edu/utk_graddiss/index.48.html', 'https://trace.tennessee.edu/utk_graddiss/index.49.html', 'https://trace.tennessee.edu/utk_graddiss/index.50.html', 'https://trace.tennessee.edu/utk_graddiss/index.51.html', 'https://trace.tennessee.edu/utk_graddiss/index.52.html', 'https://trace.tennessee.edu/utk_graddiss/index.53.html', 'https://trace.tennessee.edu/utk_graddiss/index.54.html', 'https://trace.tennessee.edu/utk_graddiss/index.55.html', 'https://trace.tennessee.edu/utk_graddiss/index.56.html', 'https://trace.tennessee.edu/utk_graddiss/index.57.html', 'https://trace.tennessee.edu/utk_graddiss/index.58.html', 'https://trace.tennessee.edu/utk_graddiss/index.59.html', 'https://trace.tennessee.edu/utk_graddiss/index.60.html', 'https://trace.tennessee.edu/utk_graddiss/index.61.html', 'https://trace.tennessee.edu/utk_graddiss/index.62.html', 'https://trace.tennessee.edu/utk_graddiss/index.63.html', 'https://trace.tennessee.edu/utk_graddiss/index.64.html', 'https://trace.tennessee.edu/utk_graddiss/index.65.html', 'https://trace.tennessee.edu/utk_graddiss/index.66.html', 'https://trace.tennessee.edu/utk_graddiss/index.67.html', 'https://trace.tennessee.edu/utk_graddiss/index.68.html', 'https://trace.tennessee.edu/utk_graddiss/index.69.html', 'https://trace.tennessee.edu/utk_graddiss/index.70.html', 'https://trace.tennessee.edu/utk_graddiss/index.71.html', 'https://trace.tennessee.edu/utk_graddiss/index.72.html', 'https://trace.tennessee.edu/utk_graddiss/index.73.html', 'https://trace.tennessee.edu/utk_graddiss/index.74.html', 'https://trace.tennessee.edu/utk_graddiss/index.75.html', 'https://trace.tennessee.edu/utk_graddiss/index.76.html', 'https://trace.tennessee.edu/utk_graddiss/index.77.html', 'https://trace.tennessee.edu/utk_graddiss/index.78.html', 'https://trace.tennessee.edu/utk_graddiss/index.79.html', 'https://trace.tennessee.edu/utk_graddiss/index.80.html', 'https://trace.tennessee.edu/utk_graddiss/index.81.html', 'https://trace.tennessee.edu/utk_graddiss/index.82.html', 'https://trace.tennessee.edu/utk_graddiss/index.83.html', 'https://trace.tennessee.edu/utk_graddiss/index.84.html', 'https://trace.tennessee.edu/utk_graddiss/index.85.html', 'https://trace.tennessee.edu/utk_graddiss/index.86.html', 'https://trace.tennessee.edu/utk_graddiss/index.87.html', 'https://trace.tennessee.edu/utk_graddiss/index.88.html', 'https://trace.tennessee.edu/utk_graddiss/index.89.html', 'https://trace.tennessee.edu/utk_graddiss/index.90.html', 'https://trace.tennessee.edu/utk_graddiss/index.91.html', 'https://trace.tennessee.edu/utk_graddiss/index.92.html', 'https://trace.tennessee.edu/utk_graddiss/index.93.html', 'https://trace.tennessee.edu/utk_graddiss/index.94.html', 'https://trace.tennessee.edu/utk_graddiss/index.95.html', 'https://trace.tennessee.edu/utk_graddiss/index.96.html', 'https://trace.tennessee.edu/utk_graddiss/index.97.html', 'https://trace.tennessee.edu/utk_graddiss/index.98.html', 'https://trace.tennessee.edu/utk_graddiss/index.99.html', 'https://trace.tennessee.edu/utk_graddiss/index.100.html', 'https://trace.tennessee.edu/utk_graddiss/index.101.html', 'https://trace.tennessee.edu/utk_graddiss/index.102.html', 'https://trace.tennessee.edu/utk_graddiss/index.103.html', 'https://trace.tennessee.edu/utk_graddiss/index.104.html', 'https://trace.tennessee.edu/utk_graddiss/index.105.html', 'https://trace.tennessee.edu/utk_graddiss/index.106.html', 'https://trace.tennessee.edu/utk_graddiss/index.107.html', 'https://trace.tennessee.edu/utk_graddiss/index.108.html', 'https://trace.tennessee.edu/utk_graddiss/index.109.html', 'https://trace.tennessee.edu/utk_graddiss/index.110.html', 'https://trace.tennessee.edu/utk_graddiss/index.111.html', 'https://trace.tennessee.edu/utk_graddiss/index.112.html', 'https://trace.tennessee.edu/utk_graddiss/index.113.html', 'https://trace.tennessee.edu/utk_graddiss/index.114.html', 'https://trace.tennessee.edu/utk_graddiss/index.115.html']