TextBlob allows to transform text—something very useful in preparation for text analysis.
Necessary code from previous sessions
Here is the necessary code from previous sessions, stripped to the minimum:
# Load packagesimport requestsimport pymupdffrom textblob import TextBlob# Download the dataurl ="https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"response = requests.get(url)# Extract data from pdfdata = response.contentdoc = pymupdf.Document(stream=data)# Create text from first pdf pagepage1 = doc[0].get_text()# Turn text into TextBlobtext = TextBlob(page1)
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[1], line 3 1# Load packages 2importrequests----> 3importpymupdf 4fromtextblobimportTextBlob 6# Download the dataModuleNotFoundError: No module named 'pymupdf'
Case
There are methods to change the case of TextBlob objects.
For example, capitalization (let’s only print the first 1000 characters)
print(text.title()[:1000])
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[2], line 1----> 1print(text.title()[:1000])NameError: name 'text' is not defined
Or transformation to upper case:
print(text.upper()[:1000])
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[3], line 1----> 1print(text.upper()[:1000])NameError: name 'text' is not defined
Number
The number (singular/plural) of particular words can also be changed:
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[4], line 1----> 1print(text.words[6]) 2print(text.words[6].singularize())NameError: name 'text' is not defined
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[5], line 1----> 1print(text.words[42]) 2print(text.words[42].pluralize())NameError: name 'text' is not defined
Lemmatization
Lemmatization reduces all words to their lemma (dictionary or canonical form) so that inflected words such as “dog” and “dogs” aren’t counted in separate categories in analyses.
Nouns
The lemmatize method uses as its default argument "n" (for noun):
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[6], line 1----> 1print(TextBlob("heirs").words[0].lemmatize()) 2print(TextBlob("daggers").words[0].lemmatize())NameError: name 'TextBlob' is not defined
Be careful: you can’t always trust that TextBlob will work properly. It is a library very easy to use, but it has its limitations.
For instance, I am not sure why this one doesn’t work:
print(TextBlob("men").words[0].lemmatize())
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[7], line 1----> 1print(TextBlob("men").words[0].lemmatize())NameError: name 'TextBlob' is not defined
While this totally works:
print(TextBlob("policemen").words[0].lemmatize())
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[8], line 1----> 1print(TextBlob("policemen").words[0].lemmatize())NameError: name 'TextBlob' is not defined
Using the more complex and more powerful NLTK Python library, you can implement the solution suggested here.
Verbs
To lemmatize verbs, you need to pass "v" (for verbs) to the lemmatize method:
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[9], line 1----> 1print(TextBlob("seen").words[0].lemmatize("v")) 2print(TextBlob("seeing").words[0].lemmatize("v")) 3print(TextBlob("sees").words[0].lemmatize("v"))NameError: name 'TextBlob' is not defined
Your turn:
Why is this one not working?
print(TextBlob("saw").words[0].lemmatize("v"))
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[10], line 1----> 1print(TextBlob("saw").words[0].lemmatize("v"))NameError: name 'TextBlob' is not defined
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[11], line 1----> 1print(TextBlob("starring").words[0].lemmatize("v")) 2print(TextBlob("stabbed").words[0].lemmatize("v")) 3print(TextBlob("howled").words[0].lemmatize("v"))NameError: name 'TextBlob' is not defined
Adjectives
To lemmatize adjectives, you need to pass "a" (for adjectives) to the lemmatize method:
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[12], line 1----> 1print(TextBlob("youngest").words[0].lemmatize("a"))NameError: name 'TextBlob' is not defined
Correction
The correct method attempts to correct spelling mistakes:
print(TextBlob("Somethingg with speling mystakes").correct())
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[13], line 1----> 1print(TextBlob("Somethingg with speling mystakes").correct())NameError: name 'TextBlob' is not defined
There are however limitations since the method is based on a lexicon and isn’t aware of the relationship between words (and thus cannot correct grammatical errors):
print(TextBlob("Some thingg with speling mystake").correct())
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[14], line 1----> 1print(TextBlob("Some thingg with speling mystake").correct())NameError: name 'TextBlob' is not defined
An example even more obvious:
print(TextBlob("He drink").correct())
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[15], line 1----> 1print(TextBlob("He drink").correct())NameError: name 'TextBlob' is not defined