Text normalization

Author

Marie-Hélène Burle

TextBlob allows to transform text—something very useful in preparation for text analysis.

Here is the necessary code from previous sessions, stripped to the minimum:

# Load packages
import requests
import pymupdf
from textblob import TextBlob

# Download the data
url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
response = requests.get(url)

# Extract data from pdf
data = response.content
doc = pymupdf.Document(stream=data)

# Create text from first pdf page
page1 = doc[0].get_text()

# Turn text into TextBlob
text = TextBlob(page1)

Case

There are methods to change the case of TextBlob objects.

For example, capitalization (let’s only print the first 1000 characters)

print(text.title()[:1000])
 
Terry Pratchett 
 
Wyrd Sisters 
 
(Starring Three Witches, Also Kings, Daggers, Crowns, Storms, Dwarfs, Cats, Ghosts, Spectres, 
Apes, Bandits, Demons, Forests, Heirs, Jesters, Tortures, Trolls, Turntables, General Rejoicing And 
Drivers Alarums.)  
 
The Wind Howled. Lightning Stabbed At The Earth Erratically, Like An Inefficient Assassin. 
Thunder Rolled Back And Forth Across The Dark, Rain-Lashed Hills. 
The Night Was As Black As The Inside Of A Cat. It Was The Kind Of Night, You Could Believe, On 
Which Gods Moved Men As Though They Were Pawns On The Chessboard Of Fate. In The Middle Of This 
Elemental Storm A Fire Gleamed Among The Dripping Furze Bushes Like The Madness In A Weasel'S Eye. 
It Illuminated Three Hunched Figures. As The Cauldron Bubbled An Eldritch Voice Shrieked: 'When 
Shall We Three Meet Again?' 
There Was A Pause. 
Finally Another Voice Said, In Far More Ordinary Tones: 'Well, I Can Do Next Tuesday.' 
 
Through The Fathomless Deeps Of Space Swims The Star Turt

Or transformation to upper case:

print(text.upper()[:1000])
 
TERRY PRATCHETT 
 
WYRD SISTERS 
 
(STARRING THREE WITCHES, ALSO KINGS, DAGGERS, CROWNS, STORMS, DWARFS, CATS, GHOSTS, SPECTRES, 
APES, BANDITS, DEMONS, FORESTS, HEIRS, JESTERS, TORTURES, TROLLS, TURNTABLES, GENERAL REJOICING AND 
DRIVERS ALARUMS.)  
 
THE WIND HOWLED. LIGHTNING STABBED AT THE EARTH ERRATICALLY, LIKE AN INEFFICIENT ASSASSIN. 
THUNDER ROLLED BACK AND FORTH ACROSS THE DARK, RAIN-LASHED HILLS. 
THE NIGHT WAS AS BLACK AS THE INSIDE OF A CAT. IT WAS THE KIND OF NIGHT, YOU COULD BELIEVE, ON 
WHICH GODS MOVED MEN AS THOUGH THEY WERE PAWNS ON THE CHESSBOARD OF FATE. IN THE MIDDLE OF THIS 
ELEMENTAL STORM A FIRE GLEAMED AMONG THE DRIPPING FURZE BUSHES LIKE THE MADNESS IN A WEASEL'S EYE. 
IT ILLUMINATED THREE HUNCHED FIGURES. AS THE CAULDRON BUBBLED AN ELDRITCH VOICE SHRIEKED: 'WHEN 
SHALL WE THREE MEET AGAIN?' 
THERE WAS A PAUSE. 
FINALLY ANOTHER VOICE SAID, IN FAR MORE ORDINARY TONES: 'WELL, I CAN DO NEXT TUESDAY.' 
 
THROUGH THE FATHOMLESS DEEPS OF SPACE SWIMS THE STAR TURT

Number

The number (singular/plural) of particular words can also be changed:

print(text.words[6])
print(text.words[6].singularize())
Witches
Witch
print(text.words[42])
print(text.words[42].pluralize())
assassin
assassins

Lemmatization

Lemmatization reduces all words to their lemma (dictionary or canonical form) so that inflected words such as “dog” and “dogs” aren’t counted in separate categories in analyses.

Nouns

The lemmatize method uses as its default argument "n" (for noun):

print(TextBlob("heirs").words[0].lemmatize())
print(TextBlob("daggers").words[0].lemmatize())
heir
dagger

Be careful: you can’t always trust that TextBlob will work properly. It is a library very easy to use, but it has its limitations.

For instance, I am not sure why this one doesn’t work:

print(TextBlob("men").words[0].lemmatize())
men

While this totally works:

print(TextBlob("policemen").words[0].lemmatize())
policeman

Using the more complex and more powerful NLTK Python library, you can implement the solution suggested here.

Verbs

To lemmatize verbs, you need to pass "v" (for verbs) to the lemmatize method:

print(TextBlob("seen").words[0].lemmatize("v"))
print(TextBlob("seeing").words[0].lemmatize("v"))
print(TextBlob("sees").words[0].lemmatize("v"))
see
see
see

Your turn:

Why is this one not working?

print(TextBlob("saw").words[0].lemmatize("v"))
saw

Examples from the text:

print(TextBlob("starring").words[0].lemmatize("v"))
print(TextBlob("stabbed").words[0].lemmatize("v"))
print(TextBlob("howled").words[0].lemmatize("v"))
print(TextBlob("rejoicing").words[0].lemmatize("v"))
star
stab
howl
rejoice

Adjectives

To lemmatize adjectives, you need to pass "a" (for adjectives) to the lemmatize method:

print(TextBlob("youngest").words[0].lemmatize("a"))
young

Correction

The correct method attempts to correct spelling mistakes:

print(TextBlob("Somethingg with speling mystakes").correct())
Something with spelling mistakes

There are however limitations since the method is based on a lexicon and isn’t aware of the relationship between words (and thus cannot correct grammatical errors):

print(TextBlob("Some thingg with speling mystake").correct())
Some things with spelling mistake

An example even more obvious:

print(TextBlob("He drink").correct())
He drink