import requests
import pymupdf
Getting the data
In this section, we will import the pdf of a book from an online URL into Python.
The text
Wyrd Sisters, the sixth Discworld novel by Terry Pratchett published in 1988, has countless references to Macbeth (including, obviously, the title), other Shakespeare’s plays, the Marx Brothers, Charlie Chaplin, and Laurel and Hardy.
The book is available as a pdf at this URL and this is the text we will use for this course.
Packages needed
First off, we need to load two of the packages that you installed in the previous section:
- Requests: this package sends requests to websites to download information. We will use it to download the pdf.
- PyMuPDF: this package will allow us to extract the content from the pdf.
Let’s load the packages into our session to make them available:
Download the data
First, let’s create a string with the URL of the online pdf:
= "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf" url
Now we can send a request to that URL to download the data and create a response
object:
= requests.get(url) response
Let’s print the value of our response to ensure that it was successful:
print(response)
<Response [200]>
On the list of HTTP status codes, you can see that 200
means OK. So our request was successful.
Then we extract the text from the pdf:
= response.content
data = pymupdf.Document(stream=data) doc
Let’s explore this doc
object that we created.
It is a Document
object from the pymupdf package:
type(doc)
pymupdf.Document
The first element corresponds to the first page of the pdf:
0] doc[
page 0 of <memory, doc# 1>
Remember that indexing in Python starts at 0
.
type(doc[0])
pymupdf.Page
The pdf had 139 pages:
len(doc)
139
We can get the text of the first page with the get_text
method. Let’s create an string that we call page1
with this text:
= doc[0].get_text() page1
We can now print the text of the first page of the pdf:
print(page1)
Terry Pratchett
Wyrd Sisters
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres,
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and
drivers alarums.)
The wind howled. Lightning stabbed at the earth erratically, like an inefficient assassin.
Thunder rolled back and forth across the dark, rain-lashed hills.
The night was as black as the inside of a cat. It was the kind of night, you could believe, on
which gods moved men as though they were pawns on the chessboard of fate. In the middle of this
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye.
It illuminated three hunched figures. As the cauldron bubbled an eldritch voice shrieked: 'When
shall we three meet again?'
There was a pause.
Finally another voice said, in far more ordinary tones: 'Well, I can do next Tuesday.'
Through the fathomless deeps of space swims the star turtle Great A'Tuin, bearing on its back
the four giant elephants who carry on their shoulders the mass of the Discworld. A tiny sun and
moon spin around them, on a complicated orbit to induce seasons, so probably nowhere else in the
multiverse is it sometimes necessary for an elephant to cock a leg to allow the sun to go past.
Exactly why this should be may never be known. Possibly the Creator of the universe got
bored with all the usual business of axial inclination, albedos and rotational velocities, and decided
to have a bit of fun for once.
It would be a pretty good bet that the gods of a world like this probably do not play chess and
indeed this is the case. In fact no gods anywhere play chess. They haven't got the imagination. Gods
prefer simple, vicious games, where you Do Not Achieve Transcendence but Go Straight To
Oblivion; a key to the understanding of all religion is that a god's idea of amusement is Snakes and
Ladders with greased rungs.
Magic glues the Discworld together – magic generated by the turning of the world itself,
magic wound like silk out of the underlying structure of existence to suture the wounds of reality.
A lot of it ends up in the Ramtop Mountains, which stretch from the frozen lands near the Hub
all the way, via a lengthy archipelago, to the warm seas which flow endlessly into space over the
Rim.
Raw magic crackles invisibly from peak to peak and earths itself in the mountains. It is the
Ramtops that supply the world with most of its witches and wizards. In the Ramtops the leaves on
the trees move even when there is no breeze. Rocks go for a stroll of an evening.
Even the land, at times, seems alive . . .
At times, so does the sky.
The storm was really giving it everything it had. This was its big chance. It had spent years
hanging around the provinces, putting in some useful work as a squall, building up experience,
making contacts, occasionally leaping out on unsuspecting shepherds or blasting quite small oak
trees. Now an opening in the weather had given it an opportunity to strut its hour, and it was
building up its role in the hope of being spotted by one of the big climates.
It was a good storm. There was quite effective projection and passion there, and critics
agreed that if it would only learn to control its thunder it would be, in years to come, a storm to
watch.
The woods roared their applause and were full of mists and flying leaves.