Text normalization

Author

Marie-Hélène Burle

TextBlob allows to transform text—something very useful in preparation for text analysis.

Here is the necessary code from previous sessions, stripped to the minimum:

# Load packages
import requests
import pymupdf
from textblob import TextBlob

# Download the data
url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
response = requests.get(url)

# Extract data from pdf
data = response.content
doc = pymupdf.Document(stream=data)

# Create text from first pdf page
page1 = doc[0].get_text()

# Turn text into TextBlob
text = TextBlob(page1)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 # Load packages
      2 import requests
----> 3 import pymupdf
      4 from textblob import TextBlob
      6 # Download the data

ModuleNotFoundError: No module named 'pymupdf'

Case

There are methods to change the case of TextBlob objects.

For example, capitalization (let’s only print the first 1000 characters)

print(text.title()[:1000])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 print(text.title()[:1000])

NameError: name 'text' is not defined

Or transformation to upper case:

print(text.upper()[:1000])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 print(text.upper()[:1000])

NameError: name 'text' is not defined

Number

The number (singular/plural) of particular words can also be changed:

print(text.words[6])
print(text.words[6].singularize())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 print(text.words[6])
      2 print(text.words[6].singularize())

NameError: name 'text' is not defined
print(text.words[42])
print(text.words[42].pluralize())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 print(text.words[42])
      2 print(text.words[42].pluralize())

NameError: name 'text' is not defined

Lemmatization

Lemmatization reduces all words to their lemma (dictionary or canonical form) so that inflected words such as “dog” and “dogs” aren’t counted in separate categories in analyses.

Nouns

The lemmatize method uses as its default argument "n" (for noun):

print(TextBlob("heirs").words[0].lemmatize())
print(TextBlob("daggers").words[0].lemmatize())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 print(TextBlob("heirs").words[0].lemmatize())
      2 print(TextBlob("daggers").words[0].lemmatize())

NameError: name 'TextBlob' is not defined

Be careful: you can’t always trust that TextBlob will work properly. It is a library very easy to use, but it has its limitations.

For instance, I am not sure why this one doesn’t work:

print(TextBlob("men").words[0].lemmatize())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(TextBlob("men").words[0].lemmatize())

NameError: name 'TextBlob' is not defined

While this totally works:

print(TextBlob("policemen").words[0].lemmatize())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 print(TextBlob("policemen").words[0].lemmatize())

NameError: name 'TextBlob' is not defined

Using the more complex and more powerful NLTK Python library, you can implement the solution suggested here.

Verbs

To lemmatize verbs, you need to pass "v" (for verbs) to the lemmatize method:

print(TextBlob("seen").words[0].lemmatize("v"))
print(TextBlob("seeing").words[0].lemmatize("v"))
print(TextBlob("sees").words[0].lemmatize("v"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 print(TextBlob("seen").words[0].lemmatize("v"))
      2 print(TextBlob("seeing").words[0].lemmatize("v"))
      3 print(TextBlob("sees").words[0].lemmatize("v"))

NameError: name 'TextBlob' is not defined

Your turn:

Why is this one not working?

print(TextBlob("saw").words[0].lemmatize("v"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 print(TextBlob("saw").words[0].lemmatize("v"))

NameError: name 'TextBlob' is not defined

Examples from the text:

print(TextBlob("starring").words[0].lemmatize("v"))
print(TextBlob("stabbed").words[0].lemmatize("v"))
print(TextBlob("howled").words[0].lemmatize("v"))
print(TextBlob("rejoicing").words[0].lemmatize("v"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(TextBlob("starring").words[0].lemmatize("v"))
      2 print(TextBlob("stabbed").words[0].lemmatize("v"))
      3 print(TextBlob("howled").words[0].lemmatize("v"))

NameError: name 'TextBlob' is not defined

Adjectives

To lemmatize adjectives, you need to pass "a" (for adjectives) to the lemmatize method:

print(TextBlob("youngest").words[0].lemmatize("a"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 print(TextBlob("youngest").words[0].lemmatize("a"))

NameError: name 'TextBlob' is not defined

Correction

The correct method attempts to correct spelling mistakes:

print(TextBlob("Somethingg with speling mystakes").correct())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 print(TextBlob("Somethingg with speling mystakes").correct())

NameError: name 'TextBlob' is not defined

There are however limitations since the method is based on a lexicon and isn’t aware of the relationship between words (and thus cannot correct grammatical errors):

print(TextBlob("Some thingg with speling mystake").correct())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 print(TextBlob("Some thingg with speling mystake").correct())

NameError: name 'TextBlob' is not defined

An example even more obvious:

print(TextBlob("He drink").correct())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 print(TextBlob("He drink").correct())

NameError: name 'TextBlob' is not defined