# Load packages
import requests
import pymupdf
from textblob import TextBlob
# Download the data
= "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
url = requests.get(url)
response
# Extract data from pdf
= response.content
data = pymupdf.Document(stream=data)
doc
# Create text from first pdf page
= doc[0].get_text()
page1
# Turn text into TextBlob
= TextBlob(page1) text
Text normalization
TextBlob allows to transform text—something very useful in preparation for text analysis.
Case
There are methods to change the case of TextBlob
objects.
For example, capitalization (let’s only print the first 1000 characters)
print(text.title()[:1000])
Terry Pratchett
Wyrd Sisters
(Starring Three Witches, Also Kings, Daggers, Crowns, Storms, Dwarfs, Cats, Ghosts, Spectres,
Apes, Bandits, Demons, Forests, Heirs, Jesters, Tortures, Trolls, Turntables, General Rejoicing And
Drivers Alarums.)
The Wind Howled. Lightning Stabbed At The Earth Erratically, Like An Inefficient Assassin.
Thunder Rolled Back And Forth Across The Dark, Rain-Lashed Hills.
The Night Was As Black As The Inside Of A Cat. It Was The Kind Of Night, You Could Believe, On
Which Gods Moved Men As Though They Were Pawns On The Chessboard Of Fate. In The Middle Of This
Elemental Storm A Fire Gleamed Among The Dripping Furze Bushes Like The Madness In A Weasel'S Eye.
It Illuminated Three Hunched Figures. As The Cauldron Bubbled An Eldritch Voice Shrieked: 'When
Shall We Three Meet Again?'
There Was A Pause.
Finally Another Voice Said, In Far More Ordinary Tones: 'Well, I Can Do Next Tuesday.'
Through The Fathomless Deeps Of Space Swims The Star Turt
Or transformation to upper case:
print(text.upper()[:1000])
TERRY PRATCHETT
WYRD SISTERS
(STARRING THREE WITCHES, ALSO KINGS, DAGGERS, CROWNS, STORMS, DWARFS, CATS, GHOSTS, SPECTRES,
APES, BANDITS, DEMONS, FORESTS, HEIRS, JESTERS, TORTURES, TROLLS, TURNTABLES, GENERAL REJOICING AND
DRIVERS ALARUMS.)
THE WIND HOWLED. LIGHTNING STABBED AT THE EARTH ERRATICALLY, LIKE AN INEFFICIENT ASSASSIN.
THUNDER ROLLED BACK AND FORTH ACROSS THE DARK, RAIN-LASHED HILLS.
THE NIGHT WAS AS BLACK AS THE INSIDE OF A CAT. IT WAS THE KIND OF NIGHT, YOU COULD BELIEVE, ON
WHICH GODS MOVED MEN AS THOUGH THEY WERE PAWNS ON THE CHESSBOARD OF FATE. IN THE MIDDLE OF THIS
ELEMENTAL STORM A FIRE GLEAMED AMONG THE DRIPPING FURZE BUSHES LIKE THE MADNESS IN A WEASEL'S EYE.
IT ILLUMINATED THREE HUNCHED FIGURES. AS THE CAULDRON BUBBLED AN ELDRITCH VOICE SHRIEKED: 'WHEN
SHALL WE THREE MEET AGAIN?'
THERE WAS A PAUSE.
FINALLY ANOTHER VOICE SAID, IN FAR MORE ORDINARY TONES: 'WELL, I CAN DO NEXT TUESDAY.'
THROUGH THE FATHOMLESS DEEPS OF SPACE SWIMS THE STAR TURT
Number
The number (singular/plural) of particular words can also be changed:
print(text.words[6])
print(text.words[6].singularize())
Witches
Witch
print(text.words[42])
print(text.words[42].pluralize())
assassin
assassins
Lemmatization
Lemmatization reduces all words to their lemma (dictionary or canonical form) so that inflected words such as “dog” and “dogs” aren’t counted in separate categories in analyses.
Nouns
The lemmatize
method uses as its default argument "n"
(for noun):
print(TextBlob("heirs").words[0].lemmatize())
print(TextBlob("daggers").words[0].lemmatize())
heir
dagger
Be careful: you can’t always trust that TextBlob will work properly. It is a library very easy to use, but it has its limitations.
For instance, I am not sure why this one doesn’t work:
print(TextBlob("men").words[0].lemmatize())
men
While this totally works:
print(TextBlob("policemen").words[0].lemmatize())
policeman
Using the more complex and more powerful NLTK Python library, you can implement the solution suggested here.
Verbs
To lemmatize verbs, you need to pass "v"
(for verbs) to the lemmatize
method:
print(TextBlob("seen").words[0].lemmatize("v"))
print(TextBlob("seeing").words[0].lemmatize("v"))
print(TextBlob("sees").words[0].lemmatize("v"))
see
see
see
Your turn:
Why is this one not working?
print(TextBlob("saw").words[0].lemmatize("v"))
saw
Examples from the text:
print(TextBlob("starring").words[0].lemmatize("v"))
print(TextBlob("stabbed").words[0].lemmatize("v"))
print(TextBlob("howled").words[0].lemmatize("v"))
print(TextBlob("rejoicing").words[0].lemmatize("v"))
star
stab
howl
rejoice
Adjectives
To lemmatize adjectives, you need to pass "a"
(for adjectives) to the lemmatize
method:
print(TextBlob("youngest").words[0].lemmatize("a"))
young
Correction
The correct
method attempts to correct spelling mistakes:
print(TextBlob("Somethingg with speling mystakes").correct())
Something with spelling mistakes
There are however limitations since the method is based on a lexicon and isn’t aware of the relationship between words (and thus cannot correct grammatical errors):
print(TextBlob("Some thingg with speling mystake").correct())
Some things with spelling mistake
An example even more obvious:
print(TextBlob("He drink").correct())
He drink