In this section, we will use the TextBlob package for part of speech tagging and basic tokenization.
Necessary code from previous sessions
Here is the necessary code from the previous session, stripped to the minimum:
# Load packagesimport requestsimport pymupdf# Download the dataurl ="https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"response = requests.get(url)# Extract data from pdfdata = response.contentdoc = pymupdf.Document(stream=data)# Create text from first pdf pagepage1 = doc[0].get_text()
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[1], line 3 1# Load packages 2importrequests----> 3importpymupdf 5# Download the data 6url="https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"ModuleNotFoundError: No module named 'pymupdf'
TextBlob
TextBlob is the NLP package that we will use in this course for tagging, tokenization, normalization, and sentiment analysis.
We first need to load it in our session:
from textblob import TextBlob
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[2], line 1----> 1fromtextblobimportTextBlobModuleNotFoundError: No module named 'textblob'
Before we can use TextBlob on our text, we need to convert the page1 string into a TextBlob object:
text = TextBlob(page1)type(text)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[3], line 1----> 1text=TextBlob(page1) 2type(text)NameError: name 'TextBlob' is not defined
You can do this simply by using the tags property on a TextBlob object: text.tags. Because there are a lot of words in the first pdf page, this would create a very long output.
The result is a list:
type(text.tags)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[4], line 1----> 1type(text.tags)NameError: name 'text' is not defined
And each element of the list is a tuple:
type(text.tags[0])
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[5], line 1----> 1type(text.tags[0])NameError: name 'text' is not defined
We don’t have to print the full list. Let’s only print the first 20 tuples:
text.tags[:20]
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[6], line 1----> 1text.tags[:20]NameError: name 'text' is not defined
Noun phrases can be extracted with the noun_phrases property:
print(text.noun_phrases)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[7], line 1----> 1print(text.noun_phrases)NameError: name 'text' is not defined
The output is a WordList object:
type(text.noun_phrases)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[8], line 1----> 1type(text.noun_phrases)NameError: name 'text' is not defined
Tokenization
Words
TextBlob allows to extract words easily with the words attribute:
print(text.words)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[9], line 1----> 1print(text.words)NameError: name 'text' is not defined
Your turn:
How many words are there in the first pdf page of Wyrd Sisters?
Sentences
Extracting sentences is just as easy with the sentences attribute.
Let’s extract the first 10 sentences:
text.sentences[:10]
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[10], line 1----> 1text.sentences[:10]NameError: name 'text' is not defined
The output is however quite ugly. We could make this a lot more readable by printing each sentence separated by a blank line:
for s in text.sentences[:10]:print(s)print("\n")
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[11], line 1----> 1forsintext.sentences[:10]: 2print(s) 3print("\n")NameError: name 'text' is not defined
In Python strings (as in many other languages), "\n" represents a new line.
Or you could add lines of hyphens between the sentences:
for s in text.sentences[:10]:print(s)print("-"*100)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[12], line 1----> 1forsintext.sentences[:10]: 2print(s) 3print("-"*100)NameError: name 'text' is not defined
Your turn:
What is the type of text.sentences?
Could you print just the 5th sentence?
Just the last sentence?
Word counts
We already saw that we can extract words with the words attribute. Now, we can add the count method to get the frequency of specific words.
text.words.count("gods")
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[13], line 1----> 1text.words.count("gods")NameError: name 'text' is not defined