Text processing

Author

Marie-Hélène Burle

In this section, we will use the TextBlob package for part of speech tagging and basic tokenization.

Here is the necessary code from the previous session, stripped to the minimum:

# Load packages
import requests
import pymupdf

# Download the data
url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
response = requests.get(url)

# Extract data from pdf
data = response.content
doc = pymupdf.Document(stream=data)

# Create text from first pdf page
page1 = doc[0].get_text()

TextBlob

TextBlob is the NLP package that we will use in this course for tagging, tokenization, normalization, and sentiment analysis.

We first need to load it in our session:

from textblob import TextBlob

Before we can use TextBlob on our text, we need to convert the page1 string into a TextBlob object:

text = TextBlob(page1)
type(text)
textblob.blob.TextBlob

Part of speech tagging

Part of speech tagging attributes parts of speech (POS) tags to each word of a text.

You can do this simply by using the tags property on a TextBlob object: text.tags. Because there are a lot of words in the first pdf page, this would create a very long output.

The result is a list:

type(text.tags)
list

And each element of the list is a tuple:

type(text.tags[0])
tuple

We don’t have to print the full list. Let’s only print the first 20 tuples:

text.tags[:20]
[('Terry', 'NNP'),
 ('Pratchett', 'NNP'),
 ('Wyrd', 'NNP'),
 ('Sisters', 'NNP'),
 ('Starring', 'VBG'),
 ('Three', 'NNP'),
 ('Witches', 'NNP'),
 ('also', 'RB'),
 ('kings', 'NNS'),
 ('daggers', 'NNS'),
 ('crowns', 'NNS'),
 ('storms', 'NNS'),
 ('dwarfs', 'NN'),
 ('cats', 'NNS'),
 ('ghosts', 'NNS'),
 ('spectres', 'NNS'),
 ('apes', 'NNS'),
 ('bandits', 'NNS'),
 ('demons', 'NNS'),
 ('forests', 'NNS')]
Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Noun phrases extraction

Noun phrases can be extracted with the noun_phrases property:

print(text.noun_phrases)
['terry pratchett wyrd sisters', 'starring', 'witches', 'drivers alarums', 'lightning', 'inefficient assassin', 'thunder', 'elemental storm', 'furze bushes', "weasel 's eye", 'eldritch voice', 'ordinary tones', 'fathomless deeps', 'space swims', 'star turtle', "a'tuin", 'giant elephants', 'discworld', 'tiny sun', 'moon spin', 'exactly', 'possibly', 'creator', 'usual business', 'axial inclination', 'rotational velocities', 'vicious games', 'achieve transcendence', 'straight', 'oblivion', "god 's idea", 'snakes', 'ladders', 'magic', 'discworld', '– magic', 'ramtop', 'frozen lands', 'hub', 'lengthy archipelago', 'warm seas', 'rim', 'raw', 'magic crackles', 'ramtops', 'ramtops', 'rocks', 'big chance', 'useful work', 'small oak trees', 'big climates', 'good storm', 'effective projection']

The output is a WordList object:

type(text.noun_phrases)
textblob.blob.WordList

Tokenization

Words

TextBlob allows to extract words easily with the words attribute:

print(text.words)
['Terry', 'Pratchett', 'Wyrd', 'Sisters', 'Starring', 'Three', 'Witches', 'also', 'kings', 'daggers', 'crowns', 'storms', 'dwarfs', 'cats', 'ghosts', 'spectres', 'apes', 'bandits', 'demons', 'forests', 'heirs', 'jesters', 'tortures', 'trolls', 'turntables', 'general', 'rejoicing', 'and', 'drivers', 'alarums', 'The', 'wind', 'howled', 'Lightning', 'stabbed', 'at', 'the', 'earth', 'erratically', 'like', 'an', 'inefficient', 'assassin', 'Thunder', 'rolled', 'back', 'and', 'forth', 'across', 'the', 'dark', 'rain-lashed', 'hills', 'The', 'night', 'was', 'as', 'black', 'as', 'the', 'inside', 'of', 'a', 'cat', 'It', 'was', 'the', 'kind', 'of', 'night', 'you', 'could', 'believe', 'on', 'which', 'gods', 'moved', 'men', 'as', 'though', 'they', 'were', 'pawns', 'on', 'the', 'chessboard', 'of', 'fate', 'In', 'the', 'middle', 'of', 'this', 'elemental', 'storm', 'a', 'fire', 'gleamed', 'among', 'the', 'dripping', 'furze', 'bushes', 'like', 'the', 'madness', 'in', 'a', 'weasel', "'s", 'eye', 'It', 'illuminated', 'three', 'hunched', 'figures', 'As', 'the', 'cauldron', 'bubbled', 'an', 'eldritch', 'voice', 'shrieked', "'When", 'shall', 'we', 'three', 'meet', 'again', 'There', 'was', 'a', 'pause', 'Finally', 'another', 'voice', 'said', 'in', 'far', 'more', 'ordinary', 'tones', "'Well", 'I', 'can', 'do', 'next', 'Tuesday', 'Through', 'the', 'fathomless', 'deeps', 'of', 'space', 'swims', 'the', 'star', 'turtle', 'Great', "A'Tuin", 'bearing', 'on', 'its', 'back', 'the', 'four', 'giant', 'elephants', 'who', 'carry', 'on', 'their', 'shoulders', 'the', 'mass', 'of', 'the', 'Discworld', 'A', 'tiny', 'sun', 'and', 'moon', 'spin', 'around', 'them', 'on', 'a', 'complicated', 'orbit', 'to', 'induce', 'seasons', 'so', 'probably', 'nowhere', 'else', 'in', 'the', 'multiverse', 'is', 'it', 'sometimes', 'necessary', 'for', 'an', 'elephant', 'to', 'cock', 'a', 'leg', 'to', 'allow', 'the', 'sun', 'to', 'go', 'past', 'Exactly', 'why', 'this', 'should', 'be', 'may', 'never', 'be', 'known', 'Possibly', 'the', 'Creator', 'of', 'the', 'universe', 'got', 'bored', 'with', 'all', 'the', 'usual', 'business', 'of', 'axial', 'inclination', 'albedos', 'and', 'rotational', 'velocities', 'and', 'decided', 'to', 'have', 'a', 'bit', 'of', 'fun', 'for', 'once', 'It', 'would', 'be', 'a', 'pretty', 'good', 'bet', 'that', 'the', 'gods', 'of', 'a', 'world', 'like', 'this', 'probably', 'do', 'not', 'play', 'chess', 'and', 'indeed', 'this', 'is', 'the', 'case', 'In', 'fact', 'no', 'gods', 'anywhere', 'play', 'chess', 'They', 'have', "n't", 'got', 'the', 'imagination', 'Gods', 'prefer', 'simple', 'vicious', 'games', 'where', 'you', 'Do', 'Not', 'Achieve', 'Transcendence', 'but', 'Go', 'Straight', 'To', 'Oblivion', 'a', 'key', 'to', 'the', 'understanding', 'of', 'all', 'religion', 'is', 'that', 'a', 'god', "'s", 'idea', 'of', 'amusement', 'is', 'Snakes', 'and', 'Ladders', 'with', 'greased', 'rungs', 'Magic', 'glues', 'the', 'Discworld', 'together', '–', 'magic', 'generated', 'by', 'the', 'turning', 'of', 'the', 'world', 'itself', 'magic', 'wound', 'like', 'silk', 'out', 'of', 'the', 'underlying', 'structure', 'of', 'existence', 'to', 'suture', 'the', 'wounds', 'of', 'reality', 'A', 'lot', 'of', 'it', 'ends', 'up', 'in', 'the', 'Ramtop', 'Mountains', 'which', 'stretch', 'from', 'the', 'frozen', 'lands', 'near', 'the', 'Hub', 'all', 'the', 'way', 'via', 'a', 'lengthy', 'archipelago', 'to', 'the', 'warm', 'seas', 'which', 'flow', 'endlessly', 'into', 'space', 'over', 'the', 'Rim', 'Raw', 'magic', 'crackles', 'invisibly', 'from', 'peak', 'to', 'peak', 'and', 'earths', 'itself', 'in', 'the', 'mountains', 'It', 'is', 'the', 'Ramtops', 'that', 'supply', 'the', 'world', 'with', 'most', 'of', 'its', 'witches', 'and', 'wizards', 'In', 'the', 'Ramtops', 'the', 'leaves', 'on', 'the', 'trees', 'move', 'even', 'when', 'there', 'is', 'no', 'breeze', 'Rocks', 'go', 'for', 'a', 'stroll', 'of', 'an', 'evening', 'Even', 'the', 'land', 'at', 'times', 'seems', 'alive', 'At', 'times', 'so', 'does', 'the', 'sky', 'The', 'storm', 'was', 'really', 'giving', 'it', 'everything', 'it', 'had', 'This', 'was', 'its', 'big', 'chance', 'It', 'had', 'spent', 'years', 'hanging', 'around', 'the', 'provinces', 'putting', 'in', 'some', 'useful', 'work', 'as', 'a', 'squall', 'building', 'up', 'experience', 'making', 'contacts', 'occasionally', 'leaping', 'out', 'on', 'unsuspecting', 'shepherds', 'or', 'blasting', 'quite', 'small', 'oak', 'trees', 'Now', 'an', 'opening', 'in', 'the', 'weather', 'had', 'given', 'it', 'an', 'opportunity', 'to', 'strut', 'its', 'hour', 'and', 'it', 'was', 'building', 'up', 'its', 'role', 'in', 'the', 'hope', 'of', 'being', 'spotted', 'by', 'one', 'of', 'the', 'big', 'climates', 'It', 'was', 'a', 'good', 'storm', 'There', 'was', 'quite', 'effective', 'projection', 'and', 'passion', 'there', 'and', 'critics', 'agreed', 'that', 'if', 'it', 'would', 'only', 'learn', 'to', 'control', 'its', 'thunder', 'it', 'would', 'be', 'in', 'years', 'to', 'come', 'a', 'storm', 'to', 'watch', 'The', 'woods', 'roared', 'their', 'applause', 'and', 'were', 'full', 'of', 'mists', 'and', 'flying', 'leaves']

Your turn:

How many words are there in the first pdf page of Wyrd Sisters?

Sentences

Extracting sentences is just as easy with the sentences attribute.

Let’s extract the first 10 sentences:

text.sentences[:10]
[Sentence(" 
 Terry Pratchett 
  
 Wyrd Sisters 
  
 (Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres, 
 apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and 
 drivers alarums.)"),
 Sentence("The wind howled."),
 Sentence("Lightning stabbed at the earth erratically, like an inefficient assassin."),
 Sentence("Thunder rolled back and forth across the dark, rain-lashed hills."),
 Sentence("The night was as black as the inside of a cat."),
 Sentence("It was the kind of night, you could believe, on 
 which gods moved men as though they were pawns on the chessboard of fate."),
 Sentence("In the middle of this 
 elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye."),
 Sentence("It illuminated three hunched figures."),
 Sentence("As the cauldron bubbled an eldritch voice shrieked: 'When 
 shall we three meet again?'"),
 Sentence("There was a pause.")]

The output is however quite ugly. We could make this a lot more readable by printing each sentence separated by a blank line:

for s in text.sentences[:10]:
    print(s)
    print("\n")
 
Terry Pratchett 
 
Wyrd Sisters 
 
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres, 
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and 
drivers alarums.)


The wind howled.


Lightning stabbed at the earth erratically, like an inefficient assassin.


Thunder rolled back and forth across the dark, rain-lashed hills.


The night was as black as the inside of a cat.


It was the kind of night, you could believe, on 
which gods moved men as though they were pawns on the chessboard of fate.


In the middle of this 
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye.


It illuminated three hunched figures.


As the cauldron bubbled an eldritch voice shrieked: 'When 
shall we three meet again?'


There was a pause.

In Python strings (as in many other languages), "\n" represents a new line.

Or you could add lines of hyphens between the sentences:

for s in text.sentences[:10]:
    print(s)
    print("-" * 100)
 
Terry Pratchett 
 
Wyrd Sisters 
 
(Starring Three Witches, also kings, daggers, crowns, storms, dwarfs, cats, ghosts, spectres, 
apes, bandits, demons, forests, heirs, jesters, tortures, trolls, turntables, general rejoicing and 
drivers alarums.)
----------------------------------------------------------------------------------------------------
The wind howled.
----------------------------------------------------------------------------------------------------
Lightning stabbed at the earth erratically, like an inefficient assassin.
----------------------------------------------------------------------------------------------------
Thunder rolled back and forth across the dark, rain-lashed hills.
----------------------------------------------------------------------------------------------------
The night was as black as the inside of a cat.
----------------------------------------------------------------------------------------------------
It was the kind of night, you could believe, on 
which gods moved men as though they were pawns on the chessboard of fate.
----------------------------------------------------------------------------------------------------
In the middle of this 
elemental storm a fire gleamed among the dripping furze bushes like the madness in a weasel's eye.
----------------------------------------------------------------------------------------------------
It illuminated three hunched figures.
----------------------------------------------------------------------------------------------------
As the cauldron bubbled an eldritch voice shrieked: 'When 
shall we three meet again?'
----------------------------------------------------------------------------------------------------
There was a pause.
----------------------------------------------------------------------------------------------------

Your turn:

  • What is the type of text.sentences?
  • Could you print just the 5th sentence?
  • Just the last sentence?

Word counts

We already saw that we can extract words with the words attribute. Now, we can add the count method to get the frequency of specific words.

text.words.count("gods")
4