Text processing

Author

Marie-Hélène Burle

In this section, we will use the TextBlob package for part of speech tagging and basic tokenization.

Here is the necessary code from the previous session, stripped to the minimum:

# Load packages
import requests
import pymupdf

# Download the data
url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"
response = requests.get(url)

# Extract data from pdf
data = response.content
doc = pymupdf.Document(stream=data)

# Create text from first pdf page
page1 = doc[0].get_text()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 # Load packages
      2 import requests
----> 3 import pymupdf
      5 # Download the data
      6 url = "https://funnyengwish.wordpress.com/wp-content/uploads/2017/05/pratchett_terry_wyrd_sisters_-_royallib_ru.pdf"

ModuleNotFoundError: No module named 'pymupdf'

TextBlob

TextBlob is the NLP package that we will use in this course for tagging, tokenization, normalization, and sentiment analysis.

We first need to load it in our session:

from textblob import TextBlob
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from textblob import TextBlob

ModuleNotFoundError: No module named 'textblob'

Before we can use TextBlob on our text, we need to convert the page1 string into a TextBlob object:

text = TextBlob(page1)
type(text)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 text = TextBlob(page1)
      2 type(text)

NameError: name 'TextBlob' is not defined

Part of speech tagging

Part of speech tagging attributes parts of speech (POS) tags to each word of a text.

You can do this simply by using the tags property on a TextBlob object: text.tags. Because there are a lot of words in the first pdf page, this would create a very long output.

The result is a list:

type(text.tags)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 type(text.tags)

NameError: name 'text' is not defined

And each element of the list is a tuple:

type(text.tags[0])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 type(text.tags[0])

NameError: name 'text' is not defined

We don’t have to print the full list. Let’s only print the first 20 tuples:

text.tags[:20]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 text.tags[:20]

NameError: name 'text' is not defined
Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Noun phrases extraction

Noun phrases can be extracted with the noun_phrases property:

print(text.noun_phrases)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(text.noun_phrases)

NameError: name 'text' is not defined

The output is a WordList object:

type(text.noun_phrases)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 type(text.noun_phrases)

NameError: name 'text' is not defined

Tokenization

Words

TextBlob allows to extract words easily with the words attribute:

print(text.words)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 print(text.words)

NameError: name 'text' is not defined

Your turn:

How many words are there in the first pdf page of Wyrd Sisters?

Sentences

Extracting sentences is just as easy with the sentences attribute.

Let’s extract the first 10 sentences:

text.sentences[:10]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 text.sentences[:10]

NameError: name 'text' is not defined

The output is however quite ugly. We could make this a lot more readable by printing each sentence separated by a blank line:

for s in text.sentences[:10]:
    print(s)
    print("\n")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 for s in text.sentences[:10]:
      2     print(s)
      3     print("\n")

NameError: name 'text' is not defined

In Python strings (as in many other languages), "\n" represents a new line.

Or you could add lines of hyphens between the sentences:

for s in text.sentences[:10]:
    print(s)
    print("-" * 100)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 for s in text.sentences[:10]:
      2     print(s)
      3     print("-" * 100)

NameError: name 'text' is not defined

Your turn:

  • What is the type of text.sentences?
  • Could you print just the 5th sentence?
  • Just the last sentence?

Word counts

We already saw that we can extract words with the words attribute. Now, we can add the count method to get the frequency of specific words.

text.words.count("gods")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 text.words.count("gods")

NameError: name 'text' is not defined