Text processing

Author

Marie-Hélène Burle

In this section, we will use the TextBlob package for part of speech tagging and basic tokenization.

Necessary code from previous sessions

Here is the necessary code from the previous session, stripped to the minimum:

# Load packages
import requests
import pymupdf

# Download the data
url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
response = requests.get(url)

# Extract data from pdf
data = response.content
doc = pymupdf.Document(stream=data)

# Create text from first pdf page
page1 = doc[0].get_text()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 # Load packages
      2 import requests
----> 3 import pymupdf
      5 # Download the data
      6 url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"

ModuleNotFoundError: No module named 'pymupdf'

TextBlob

TextBlob is the NLP package that we will use in this course for tagging, tokenization, normalization, and sentiment analysis.

We first need to load it in our session:

from textblob import TextBlob

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from textblob import TextBlob

ModuleNotFoundError: No module named 'textblob'

Before we can use TextBlob on our text, we need to convert the page1 string into a TextBlob object:

text = TextBlob(page1)
type(text)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 text = TextBlob(page1)
      2 type(text)

NameError: name 'TextBlob' is not defined

Part of speech tagging

Part of speech tagging attributes parts of speech (POS) tags to each word of a text.

You can do this simply by using the tags property on a TextBlob object: text.tags. Because there are a lot of words in the first pdf page, this would create a very long output.

The result is a list:

type(text.tags)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 type(text.tags)

NameError: name 'text' is not defined

And each element of the list is a tuple:

type(text.tags[0])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 type(text.tags[0])

NameError: name 'text' is not defined

We don’t have to print the full list. Let’s only print the first 20 tuples:

text.tags[:20]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 text.tags[:20]

NameError: name 'text' is not defined

Tagset from the University of Pennsylvania as reference

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	Existential there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund or present participle
VBN	Verb, past participle
VBP	Verb, non-3rd person singular present
VBZ	Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP$	Possessive wh-pronoun
WRB	Wh-adverb

Noun phrases extraction

Noun phrases can be extracted with the noun_phrases property:

print(text.noun_phrases)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(text.noun_phrases)

NameError: name 'text' is not defined

The output is a WordList object:

type(text.noun_phrases)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 type(text.noun_phrases)

NameError: name 'text' is not defined

Tokenization

Words

TextBlob allows to extract words easily with the words attribute:

print(text.words)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 print(text.words)

NameError: name 'text' is not defined

Your turn:

How many words are there in the first pdf page of Wyrd Sisters?

Sentences

Extracting sentences is just as easy with the sentences attribute.

Let’s extract the first 10 sentences:

text.sentences[:10]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 text.sentences[:10]

NameError: name 'text' is not defined

The output is however quite ugly. We could make this a lot more readable by printing each sentence separated by a blank line:

for s in text.sentences[:10]:
    print(s)
    print("\n")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 for s in text.sentences[:10]:
      2     print(s)
      3     print("\n")

NameError: name 'text' is not defined

In Python strings (as in many other languages), "\n" represents a new line.

Or you could add lines of hyphens between the sentences:

for s in text.sentences[:10]:
    print(s)
    print("-" * 100)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 for s in text.sentences[:10]:
      2     print(s)
      3     print("-" * 100)

NameError: name 'text' is not defined

Your turn:

What is the type of text.sentences?
Could you print just the 5^th sentence?
Just the last sentence?

Word counts

We already saw that we can extract words with the words attribute. Now, we can add the count method to get the frequency of specific words.

text.words.count("gods")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 text.words.count("gods")

NameError: name 'text' is not defined