# Load packages
import requests
import pymupdf
# Download the data
= "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
url = requests.get(url)
response
# Extract data from pdf
= response.content
data = pymupdf.Document(stream=data)
doc
# Create text from first pdf page
= doc[0].get_text() page1
Text processing
In this section, we will use the TextBlob package for part of speech tagging and basic tokenization.
TextBlob
TextBlob is the NLP package that we will use in this course for tagging, tokenization, normalization, and sentiment analysis.
We first need to load it in our session:
from textblob import TextBlob
Before we can use TextBlob on our text, we need to convert the page1
string into a TextBlob
object:
= TextBlob(page1)
text type(text)
textblob.blob.TextBlob
Part of speech tagging
Part of speech tagging attributes parts of speech (POS) tags to each word of a text.
You can do this simply by using the tags
property on a TextBlob object: text.tags
. Because there are a lot of words in the first pdf page, this would create a very long output.
The result is a list:
type(text.tags)
list
And each element of the list is a tuple:
type(text.tags[0])
tuple
We don’t have to print the full list. Let’s only print the first 20 tuples:
20] text.tags[:
[('Terry', 'NNP'),
('Pratchett', 'NNP'),
('Wyrd', 'NNP'),
('Sisters', 'NNP'),
('Starring', 'VBG'),
('Three', 'NNP'),
('Witches', 'NNP'),
('also', 'RB'),
('kings', 'NNS'),
('daggers', 'NNS'),
('crowns', 'NNS'),
('storms', 'NNS'),
('dwarfs', 'NN'),
('cats', 'NNS'),
('ghosts', 'NNS'),
('spectres', 'NNS'),
('apes', 'NNS'),
('bandits', 'NNS'),
('demons', 'NNS'),
('forests', 'NNS')]
Noun phrases extraction
Noun phrases can be extracted with the noun_phrases
property:
print(text.noun_phrases)
['terry pratchett wyrd sisters', 'starring', 'witches', 'drivers alarums', 'lightning', 'inefficient assassin', 'thunder', 'elemental storm', 'furze bushes', "weasel 's eye", 'eldritch voice', 'ordinary tones', 'fathomless deeps', 'space swims', 'star turtle', "a'tuin", 'giant elephants', 'discworld', 'tiny sun', 'moon spin', 'exactly', 'possibly', 'creator', 'usual business', 'axial inclination', 'rotational velocities', 'vicious games', 'achieve transcendence', 'straight', 'oblivion', "god 's idea", 'snakes', 'ladders', 'magic', 'discworld', '– magic', 'ramtop', 'frozen lands', 'hub', 'lengthy archipelago', 'warm seas', 'rim', 'raw', 'magic crackles', 'ramtops', 'ramtops', 'rocks', 'big chance', 'useful work', 'small oak trees', 'big climates', 'good storm', 'effective projection']
The output is a WordList
object:
type(text.noun_phrases)
textblob.blob.WordList
Tokenization
Words
TextBlob allows to extract words easily with the words
attribute:
print(text.words)
['Terry', 'Pratchett', 'Wyrd', 'Sisters', 'Starring', 'Three', 'Witches', 'also', 'kings', 'daggers', 'crowns', 'storms', 'dwarfs', 'cats', 'ghosts', 'spectres', 'apes', 'bandits', 'demons', 'forests', 'heirs', 'jesters', 'tortures', 'trolls', 'turntables', 'general', 'rejoicing', 'and', 'drivers', 'alarums', 'The', 'wind', 'howled', 'Lightning', 'stabbed', 'at', 'the', 'earth', 'erratically', 'like', 'an', 'inefficient', 'assassin', 'Thunder', 'rolled', 'back', 'and', 'forth', 'across', 'the', 'dark', 'rain-lashed', 'hills', 'The', 'night', 'was', 'as', 'black', 'as', 'the', 'inside', 'of', 'a', 'cat', 'It', 'was', 'the', 'kind', 'of', 'night', 'you', 'could', 'believe', 'on', 'which', 'gods', 'moved', 'men', 'as', 'though', 'they', 'were', 'pawns', 'on', 'the', 'chessboard', 'of', 'fate', 'In', 'the', 'middle', 'of', 'this', 'elemental', 'storm', 'a', 'fire', 'gleamed', 'among', 'the', 'dripping', 'furze', 'bushes', 'like', 'the', 'madness', 'in', 'a', 'weasel', "'s", 'eye', 'It', 'illuminated', 'three', 'hunched', 'figures', 'As', 'the', 'cauldron', 'bubbled', 'an', 'eldritch', 'voice', 'shrieked', "'When", 'shall', 'we', 'three', 'meet', 'again', 'There', 'was', 'a', 'pause', 'Finally', 'another', 'voice', 'said', 'in', 'far', 'more', 'ordinary', 'tones', "'Well", 'I', 'can', 'do', 'next', 'Tuesday', 'Through', 'the', 'fathomless', 'deeps', 'of', 'space', 'swims', 'the', 'star', 'turtle', 'Great', "A'Tuin", 'bearing', 'on', 'its', 'back', 'the', 'four', 'giant', 'elephants', 'who', 'carry', 'on', 'their', 'shoulders', 'the', 'mass', 'of', 'the', 'Discworld', 'A', 'tiny', 'sun', 'and', 'moon', 'spin', 'around', 'them', 'on', 'a', 'complicated', 'orbit', 'to', 'induce', 'seasons', 'so', 'probably', 'nowhere', 'else', 'in', 'the', 'multiverse', 'is', 'it', 'sometimes', 'necessary', 'for', 'an', 'elephant', 'to', 'cock', 'a', 'leg', 'to', 'allow', 'the', 'sun', 'to', 'go', 'past', 'Exactly', 'why', 'this', 'should', 'be', 'may', 'never', 'be', 'known', 'Possibly', 'the', 'Creator', 'of', 'the', 'universe', 'got', 'bored', 'with', 'all', 'the', 'usual', 'business', 'of', 'axial', 'inclination', 'albedos', 'and', 'rotational', 'velocities', 'and', 'decided', 'to', 'have', 'a', 'bit', 'of', 'fun', 'for', 'once', 'It', 'would', 'be', 'a', 'pretty', 'good', 'bet', 'that', 'the', 'gods', 'of', 'a', 'world', 'like', 'this', 'probably', 'do', 'not', 'play', 'chess', 'and', 'indeed', 'this', 'is', 'the', 'case', 'In', 'fact', 'no', 'gods', 'anywhere', 'play', 'chess', 'They', 'have', "n't", 'got', 'the', 'imagination', 'Gods', 'prefer', 'simple', 'vicious', 'games', 'where', 'you', 'Do', 'Not', 'Achieve', 'Transcendence', 'but', 'Go', 'Straight', 'To', 'Oblivion', 'a', 'key', 'to', 'the', 'understanding', 'of', 'all', 'religion', 'is', 'that', 'a', 'god', "'s", 'idea', 'of', 'amusement', 'is', 'Snakes', 'and', 'Ladders', 'with', 'greased', 'rungs', 'Magic', 'glues', 'the', 'Discworld', 'together', '–', 'magic', 'generated', 'by', 'the', 'turning', 'of', 'the', 'world', 'itself', 'magic', 'wound', 'like', 'silk', 'out', 'of', 'the', 'underlying', 'structure', 'of', 'existence', 'to', 'suture', 'the', 'wounds', 'of', 'reality', 'A', 'lot', 'of', 'it', 'ends', 'up', 'in', 'the', 'Ramtop', 'Mountains', 'which', 'stretch', 'from', 'the', 'frozen', 'lands', 'near', 'the', 'Hub', 'all', 'the', 'way', 'via', 'a', 'lengthy', 'archipelago', 'to', 'the', 'warm', 'seas', 'which', 'flow', 'endlessly', 'into', 'space', 'over', 'the', 'Rim', 'Raw', 'magic', 'crackles', 'invisibly', 'from', 'peak', 'to', 'peak', 'and', 'earths', 'itself', 'in', 'the', 'mountains', 'It', 'is', 'the', 'Ramtops', 'that', 'supply', 'the', 'world', 'with', 'most', 'of', 'its', 'witches', 'and', 'wizards', 'In', 'the', 'Ramtops', 'the', 'leaves', 'on', 'the', 'trees', 'move', 'even', 'when', 'there', 'is', 'no', 'breeze', 'Rocks', 'go', 'for', 'a', 'stroll', 'of', 'an', 'evening', 'Even', 'the', 'land', 'at', 'times', 'seems', 'alive', 'At', 'times', 'so', 'does', 'the', 'sky', 'The', 'storm', 'was', 'really', 'giving', 'it', 'everything', 'it', 'had', 'This', 'was', 'its', 'big', 'chance', 'It', 'had', 'spent', 'years', 'hanging', 'around', 'the', 'provinces', 'putting', 'in', 'some', 'useful', 'work', 'as', 'a', 'squall', 'building', 'up', 'experience', 'making', 'contacts', 'occasionally', 'leaping', 'out', 'on', 'unsuspecting', 'shepherds', 'or', 'blasting', 'quite', 'small', 'oak', 'trees', 'Now', 'an', 'opening', 'in', 'the', 'weather', 'had', 'given', 'it', 'an', 'opportunity', 'to', 'strut', 'its', 'hour', 'and', 'it', 'was', 'building', 'up', 'its', 'role', 'in', 'the', 'hope', 'of', 'being', 'spotted', 'by', 'one', 'of', 'the', 'big', 'climates', 'It', 'was', 'a', 'good', 'storm', 'There', 'was', 'quite', 'effective', 'projection', 'and', 'passion', 'there', 'and', 'critics', 'agreed', 'that', 'if', 'it', 'would', 'only', 'learn', 'to', 'control', 'its', 'thunder', 'it', 'would', 'be', 'in', 'years', 'to', 'come', 'a', 'storm', 'to', 'watch', 'The', 'woods', 'roared', 'their', 'applause', 'and', 'were', 'full', 'of', 'mists', 'and', 'flying', 'leaves']
Your turn:
How many words are there in the first pdf page of Wyrd Sisters?
Sentences
Extracting sentences is just as easy with the sentences
attribute.
Let’s extract the first 10 sentences:
10] text.sentences[:
[Sentence("
Terry Pratchett
Wyrd Sisters
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres,
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and
drivers alarums.)"),
Sentence("The wind howled."),
Sentence("Lightning stabbed at the earth erratically, like an inefficient assassin."),
Sentence("Thunder rolled back and forth across the dark, rain-lashed hills."),
Sentence("The night was as black as the inside of a cat."),
Sentence("It was the kind of night, you could believe, on
which gods moved men as though they were pawns on the chessboard of fate."),
Sentence("In the middle of this
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye."),
Sentence("It illuminated three hunched figures."),
Sentence("As the cauldron bubbled an eldritch voice shrieked: 'When
shall we three meet again?'"),
Sentence("There was a pause.")]
The output is however quite ugly. We could make this a lot more readable by printing each sentence separated by a blank line:
for s in text.sentences[:10]:
print(s)
print("\n")
Terry Pratchett
Wyrd Sisters
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres,
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and
drivers alarums.)
The wind howled.
Lightning stabbed at the earth erratically, like an inefficient assassin.
Thunder rolled back and forth across the dark, rain-lashed hills.
The night was as black as the inside of a cat.
It was the kind of night, you could believe, on
which gods moved men as though they were pawns on the chessboard of fate.
In the middle of this
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye.
It illuminated three hunched figures.
As the cauldron bubbled an eldritch voice shrieked: 'When
shall we three meet again?'
There was a pause.
In Python strings (as in many other languages), "\n"
represents a new line.
Or you could add lines of hyphens between the sentences:
for s in text.sentences[:10]:
print(s)
print("-" * 100)
Terry Pratchett
Wyrd Sisters
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres,
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and
drivers alarums.)
----------------------------------------------------------------------------------------------------
The wind howled.
----------------------------------------------------------------------------------------------------
Lightning stabbed at the earth erratically, like an inefficient assassin.
----------------------------------------------------------------------------------------------------
Thunder rolled back and forth across the dark, rain-lashed hills.
----------------------------------------------------------------------------------------------------
The night was as black as the inside of a cat.
----------------------------------------------------------------------------------------------------
It was the kind of night, you could believe, on
which gods moved men as though they were pawns on the chessboard of fate.
----------------------------------------------------------------------------------------------------
In the middle of this
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye.
----------------------------------------------------------------------------------------------------
It illuminated three hunched figures.
----------------------------------------------------------------------------------------------------
As the cauldron bubbled an eldritch voice shrieked: 'When
shall we three meet again?'
----------------------------------------------------------------------------------------------------
There was a pause.
----------------------------------------------------------------------------------------------------
Your turn:
- What is the type of
text.sentences
?
- Could you print just the 5th sentence?
- Just the last sentence?
Word counts
We already saw that we can extract words with the words
attribute. Now, we can add the count
method to get the frequency of specific words.
"gods") text.words.count(
4