The book is available as a pdf at this URL and this is the text we will use for this course.
Art by Josh Kirby used for the cover of Wyrd Sisters
Packages needed
First off, we need to load two of the packages that you installed in the previous section:
Requests: this package sends requests to websites to download information. We will use it to download the pdf.
PyMuPDF: this package will allow us to extract the content from the pdf.
Let’s load the packages into our session to make them available:
import requestsimport pymupdf
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[1], line 2 1importrequests----> 2importpymupdfModuleNotFoundError: No module named 'pymupdf'
Download the data
First, let’s create a string with the URL of the online pdf:
Now we can send a request to that URL to download the data and create a response object:
response = requests.get(url)
Let’s print the value of our response to ensure that it was successful:
print(response)
<Response [200]>
On the list of HTTP status codes, you can see that 200 means OK. So our request was successful.
Then we extract the text from the pdf:
data = response.contentdoc = pymupdf.Document(stream=data)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[5], line 2 1data=response.content----> 2doc=pymupdf.Document(stream=data)NameError: name 'pymupdf' is not defined
Let’s explore this doc object that we created.
It is a Document object from the pymupdf package:
type(doc)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[6], line 1----> 1type(doc)NameError: name 'doc' is not defined
The first element corresponds to the first page of the pdf:
doc[0]
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[7], line 1----> 1doc[0]NameError: name 'doc' is not defined
Remember that indexing in Python starts at 0.
type(doc[0])
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[8], line 1----> 1type(doc[0])NameError: name 'doc' is not defined
The pdf had 139 pages:
len(doc)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[9], line 1----> 1len(doc)NameError: name 'doc' is not defined
We can get the text of the first page with the get_text method. Let’s create an string that we call page1 with this text:
page1 = doc[0].get_text()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[10], line 1----> 1page1=doc[0].get_text()NameError: name 'doc' is not defined
We can now print the text of the first page of the pdf:
print(page1)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[11], line 1----> 1print(page1)NameError: name 'page1' is not defined