Getting the data

Author

Marie-Hélène Burle

In this section, we will import the pdf of a book from an online URL into Python.

The text

Wyrd Sisters, the sixth Discworld novel by Terry Pratchett published in 1988, has countless references to Macbeth (including, obviously, the title), other Shakespeare’s plays, the Marx Brothers, Charlie Chaplin, and Laurel and Hardy.

The book is available as a pdf at this URL and this is the text we will use for this course.

 

Art by Josh Kirby used for the cover of Wyrd Sisters

Packages needed

First off, we need to load two of the packages that you installed in the previous section:

  • Requests: this package sends requests to websites to download information. We will use it to download the pdf.
  • PyMuPDF: this package will allow us to extract the content from the pdf.

Let’s load the packages into our session to make them available:

import requests
import pymupdf
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import requests
----> 2 import pymupdf

ModuleNotFoundError: No module named 'pymupdf'

Download the data

First, let’s create a string with the URL of the online pdf:

url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"

Now we can send a request to that URL to download the data and create a response object:

response = requests.get(url)

Let’s print the value of our response to ensure that it was successful:

print(response)
<Response [200]>

On the list of HTTP status codes, you can see that 200 means OK. So our request was successful.

Then we extract the text from the pdf:

data = response.content
doc = pymupdf.Document(stream=data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 data = response.content
----> 2 doc = pymupdf.Document(stream=data)

NameError: name 'pymupdf' is not defined

Let’s explore this doc object that we created.

It is a Document object from the pymupdf package:

type(doc)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 type(doc)

NameError: name 'doc' is not defined

The first element corresponds to the first page of the pdf:

doc[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 doc[0]

NameError: name 'doc' is not defined

Remember that indexing in Python starts at 0.

type(doc[0])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 type(doc[0])

NameError: name 'doc' is not defined

The pdf had 139 pages:

len(doc)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 len(doc)

NameError: name 'doc' is not defined

We can get the text of the first page with the get_text method. Let’s create an string that we call page1 with this text:

page1 = doc[0].get_text()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 page1 = doc[0].get_text()

NameError: name 'doc' is not defined

We can now print the text of the first page of the pdf:

print(page1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(page1)

NameError: name 'page1' is not defined