Playing with text

Author

Marie-Hélène Burle

There are fancy tools to scrape the web and play with text. In preparation for those, in this section, we will download a text file from the internet and play with it using simple commands.

Downloading a text file from a URL

First, we need to load the urllib.request module from the Python standard library. It contains functions to deal with URLs:

import urllib.request

The snippet of text we will play with is in a text file containing the very beginning of the novel Going Postal by Terry Pratchett and located at the URL https://mint.westdri.ca/python/data/pratchett.txt. We can create a variable that we call url (we can call it whatever we want) and that contains the string of the URL:

url = "https://mint.westdri.ca/python/data/pratchett.txt"
print(url)
https://mint.westdri.ca/python/data/pratchett.txt
type(url)
str

To download a text file from a URL, we use the urllib.request.urlopen function:

urllib.request.urlopen(url)
<http.client.HTTPResponse at 0x72ffe7f45c90>

This return an HTTPResponse object. It is not very useful in this form, but we can get the text out of it by applying the read method:

urllib.request.urlopen(url).read()
b'They say that the prospect of being hanged in the morning concentrates a man\'s mind wonderfully; unfortunately, what the mind inevitably concentrates on is that, in the morning, it will be in a body that is going to be hanged.\nThe man going to be hanged had been named Moist von Lipwig by doting if unwise parents, but he was not going to embarrass the name, insofar as that was still possible, by being hung under it. To the world in general, and particularly on that bit of it known as the death warrant, he was Alfred Spangler.\nAnd he took a more positive approach to the situation and had concentrated his mind on the prospect of not being hanged in the morning, and, most particularly, on the prospect of removing all the crumbling mortar from around a stone in his cell wall with a spoon. So far the work had taken him five weeks and reduced the spoon to something like a nail file. Fortunately, no one ever came to change the bedding here, or else they would have discovered the world\'s heaviest mattress.\nIt was a large and heavy stone that was currently the object of his attentions, and, at some point, a huge staple had been hammered into it as an anchor for manacles.\nMoist sat down facing the wall, gripped the iron ring in both hands, braced his legs against the stones on either side, and heaved.\nHis shoulders caught fire, and a red mist filled his vision, but the block slid out with a faint and inappropriate tinkling noise. Moist managed to ease it away from the hole and peered inside.\nAt the far end was another block, and the mortar around it looked suspiciously strong and fresh.\nJust in front of it was a new spoon. It was shiny.\nAs he studied it, he heard the clapping behind him. He turned his head, tendons twanging a little riff of agony, and saw several of the wardens watching him through the bars.\n"Well done, Mr. Spangler!" said one of them. "Ron here owes me five dollars! I told him you were a sticker!! \'He\'s a sticker,\' I said!"\n"You set this up, did you, Mr. Wilkinson?" said Moist weakly, watching the glint of light on the spoon.\n"Oh, not us, sir. Lord Vetinari\'s orders. He insists that all condemned prisoners should be offered the prospect of freedom."\n"Freedom? But there\'s a damn great stone through there!"\n"Yes, there is that, sir, yes, there is that," said the warden. "It\'s only the prospect, you see. Not actual free freedom as such. Hah, that\'d be a bit daft, eh?"\n"I suppose so, yes," said Moist. He didn\'t say "you bastards." The wardens had treated him quite civilly these past six weeks, and he made a point of getting on with people. He was very, very good at it. People skills were part of his stock-in-trade; they were nearly the whole of it.\nBesides, these people had big sticks. So, speaking carefully, he added: "Some people might consider this cruel, Mr. Wilkinson."\n"Yes, sir, we asked him about that, sir, but he said no, it wasn\'t. He said it provided"--his forehead wrinkled "--occ-you-pay-shun-all ther-rap-py, healthy exercise, prevented moping, and offered that greatest of all treasures, which is Hope, sir."\n"Hope," muttered Moist glumly.\n"Not upset, are you, sir?"\n"Upset? Why should I be upset, Mr. Wilkinson?"\n"Only the last bloke we had in this cell, he managed to get down that drain, sir. Very small man. Very agile."\n'

We can save our text in a new variable:

encoded_text = urllib.request.urlopen(url).read()

Now, encoded_text is not of a very convenient type:

type(encoded_text)
bytes

Before we can really start playing with it, we want to convert it to a string by decoding it:

text = encoded_text.decode("utf-8")
type(text)
str

We know have a string, which is great to work on. Let’s print our text:

print(text)
They say that the prospect of being hanged in the morning concentrates a man's mind wonderfully; unfortunately, what the mind inevitably concentrates on is that, in the morning, it will be in a body that is going to be hanged.
The man going to be hanged had been named Moist von Lipwig by doting if unwise parents, but he was not going to embarrass the name, insofar as that was still possible, by being hung under it. To the world in general, and particularly on that bit of it known as the death warrant, he was Alfred Spangler.
And he took a more positive approach to the situation and had concentrated his mind on the prospect of not being hanged in the morning, and, most particularly, on the prospect of removing all the crumbling mortar from around a stone in his cell wall with a spoon. So far the work had taken him five weeks and reduced the spoon to something like a nail file. Fortunately, no one ever came to change the bedding here, or else they would have discovered the world's heaviest mattress.
It was a large and heavy stone that was currently the object of his attentions, and, at some point, a huge staple had been hammered into it as an anchor for manacles.
Moist sat down facing the wall, gripped the iron ring in both hands, braced his legs against the stones on either side, and heaved.
His shoulders caught fire, and a red mist filled his vision, but the block slid out with a faint and inappropriate tinkling noise. Moist managed to ease it away from the hole and peered inside.
At the far end was another block, and the mortar around it looked suspiciously strong and fresh.
Just in front of it was a new spoon. It was shiny.
As he studied it, he heard the clapping behind him. He turned his head, tendons twanging a little riff of agony, and saw several of the wardens watching him through the bars.
"Well done, Mr. Spangler!" said one of them. "Ron here owes me five dollars! I told him you were a sticker!! 'He's a sticker,' I said!"
"You set this up, did you, Mr. Wilkinson?" said Moist weakly, watching the glint of light on the spoon.
"Oh, not us, sir. Lord Vetinari's orders. He insists that all condemned prisoners should be offered the prospect of freedom."
"Freedom? But there's a damn great stone through there!"
"Yes, there is that, sir, yes, there is that," said the warden. "It's only the prospect, you see. Not actual free freedom as such. Hah, that'd be a bit daft, eh?"
"I suppose so, yes," said Moist. He didn't say "you bastards." The wardens had treated him quite civilly these past six weeks, and he made a point of getting on with people. He was very, very good at it. People skills were part of his stock-in-trade; they were nearly the whole of it.
Besides, these people had big sticks. So, speaking carefully, he added: "Some people might consider this cruel, Mr. Wilkinson."
"Yes, sir, we asked him about that, sir, but he said no, it wasn't. He said it provided"--his forehead wrinkled "--occ-you-pay-shun-all ther-rap-py, healthy exercise, prevented moping, and offered that greatest of all treasures, which is Hope, sir."
"Hope," muttered Moist glumly.
"Not upset, are you, sir?"
"Upset? Why should I be upset, Mr. Wilkinson?"
"Only the last bloke we had in this cell, he managed to get down that drain, sir. Very small man. Very agile."

And now we can start playing with the data 🙂

Counting things

One of the things we can do with our text is counting things.

Counting characters

For instance, we can count the number of characters thanks to the len function:

len(text)
3294

Counting words

Something else that we can count is the number of occurrences of the name of the main character (“Moist”—I know … what a crazy name):

text.count("Moist")
6

Or we could try to see how many words there are in this text.

Your turn:

How would you go about this?

Another method to count the number of words is to use the split method:

words = text.split()
print(words)
['They', 'say', 'that', 'the', 'prospect', 'of', 'being', 'hanged', 'in', 'the', 'morning', 'concentrates', 'a', "man's", 'mind', 'wonderfully;', 'unfortunately,', 'what', 'the', 'mind', 'inevitably', 'concentrates', 'on', 'is', 'that,', 'in', 'the', 'morning,', 'it', 'will', 'be', 'in', 'a', 'body', 'that', 'is', 'going', 'to', 'be', 'hanged.', 'The', 'man', 'going', 'to', 'be', 'hanged', 'had', 'been', 'named', 'Moist', 'von', 'Lipwig', 'by', 'doting', 'if', 'unwise', 'parents,', 'but', 'he', 'was', 'not', 'going', 'to', 'embarrass', 'the', 'name,', 'insofar', 'as', 'that', 'was', 'still', 'possible,', 'by', 'being', 'hung', 'under', 'it.', 'To', 'the', 'world', 'in', 'general,', 'and', 'particularly', 'on', 'that', 'bit', 'of', 'it', 'known', 'as', 'the', 'death', 'warrant,', 'he', 'was', 'Alfred', 'Spangler.', 'And', 'he', 'took', 'a', 'more', 'positive', 'approach', 'to', 'the', 'situation', 'and', 'had', 'concentrated', 'his', 'mind', 'on', 'the', 'prospect', 'of', 'not', 'being', 'hanged', 'in', 'the', 'morning,', 'and,', 'most', 'particularly,', 'on', 'the', 'prospect', 'of', 'removing', 'all', 'the', 'crumbling', 'mortar', 'from', 'around', 'a', 'stone', 'in', 'his', 'cell', 'wall', 'with', 'a', 'spoon.', 'So', 'far', 'the', 'work', 'had', 'taken', 'him', 'five', 'weeks', 'and', 'reduced', 'the', 'spoon', 'to', 'something', 'like', 'a', 'nail', 'file.', 'Fortunately,', 'no', 'one', 'ever', 'came', 'to', 'change', 'the', 'bedding', 'here,', 'or', 'else', 'they', 'would', 'have', 'discovered', 'the', "world's", 'heaviest', 'mattress.', 'It', 'was', 'a', 'large', 'and', 'heavy', 'stone', 'that', 'was', 'currently', 'the', 'object', 'of', 'his', 'attentions,', 'and,', 'at', 'some', 'point,', 'a', 'huge', 'staple', 'had', 'been', 'hammered', 'into', 'it', 'as', 'an', 'anchor', 'for', 'manacles.', 'Moist', 'sat', 'down', 'facing', 'the', 'wall,', 'gripped', 'the', 'iron', 'ring', 'in', 'both', 'hands,', 'braced', 'his', 'legs', 'against', 'the', 'stones', 'on', 'either', 'side,', 'and', 'heaved.', 'His', 'shoulders', 'caught', 'fire,', 'and', 'a', 'red', 'mist', 'filled', 'his', 'vision,', 'but', 'the', 'block', 'slid', 'out', 'with', 'a', 'faint', 'and', 'inappropriate', 'tinkling', 'noise.', 'Moist', 'managed', 'to', 'ease', 'it', 'away', 'from', 'the', 'hole', 'and', 'peered', 'inside.', 'At', 'the', 'far', 'end', 'was', 'another', 'block,', 'and', 'the', 'mortar', 'around', 'it', 'looked', 'suspiciously', 'strong', 'and', 'fresh.', 'Just', 'in', 'front', 'of', 'it', 'was', 'a', 'new', 'spoon.', 'It', 'was', 'shiny.', 'As', 'he', 'studied', 'it,', 'he', 'heard', 'the', 'clapping', 'behind', 'him.', 'He', 'turned', 'his', 'head,', 'tendons', 'twanging', 'a', 'little', 'riff', 'of', 'agony,', 'and', 'saw', 'several', 'of', 'the', 'wardens', 'watching', 'him', 'through', 'the', 'bars.', '"Well', 'done,', 'Mr.', 'Spangler!"', 'said', 'one', 'of', 'them.', '"Ron', 'here', 'owes', 'me', 'five', 'dollars!', 'I', 'told', 'him', 'you', 'were', 'a', 'sticker!!', "'He's", 'a', "sticker,'", 'I', 'said!"', '"You', 'set', 'this', 'up,', 'did', 'you,', 'Mr.', 'Wilkinson?"', 'said', 'Moist', 'weakly,', 'watching', 'the', 'glint', 'of', 'light', 'on', 'the', 'spoon.', '"Oh,', 'not', 'us,', 'sir.', 'Lord', "Vetinari's", 'orders.', 'He', 'insists', 'that', 'all', 'condemned', 'prisoners', 'should', 'be', 'offered', 'the', 'prospect', 'of', 'freedom."', '"Freedom?', 'But', "there's", 'a', 'damn', 'great', 'stone', 'through', 'there!"', '"Yes,', 'there', 'is', 'that,', 'sir,', 'yes,', 'there', 'is', 'that,"', 'said', 'the', 'warden.', '"It\'s', 'only', 'the', 'prospect,', 'you', 'see.', 'Not', 'actual', 'free', 'freedom', 'as', 'such.', 'Hah,', "that'd", 'be', 'a', 'bit', 'daft,', 'eh?"', '"I', 'suppose', 'so,', 'yes,"', 'said', 'Moist.', 'He', "didn't", 'say', '"you', 'bastards."', 'The', 'wardens', 'had', 'treated', 'him', 'quite', 'civilly', 'these', 'past', 'six', 'weeks,', 'and', 'he', 'made', 'a', 'point', 'of', 'getting', 'on', 'with', 'people.', 'He', 'was', 'very,', 'very', 'good', 'at', 'it.', 'People', 'skills', 'were', 'part', 'of', 'his', 'stock-in-trade;', 'they', 'were', 'nearly', 'the', 'whole', 'of', 'it.', 'Besides,', 'these', 'people', 'had', 'big', 'sticks.', 'So,', 'speaking', 'carefully,', 'he', 'added:', '"Some', 'people', 'might', 'consider', 'this', 'cruel,', 'Mr.', 'Wilkinson."', '"Yes,', 'sir,', 'we', 'asked', 'him', 'about', 'that,', 'sir,', 'but', 'he', 'said', 'no,', 'it', "wasn't.", 'He', 'said', 'it', 'provided"--his', 'forehead', 'wrinkled', '"--occ-you-pay-shun-all', 'ther-rap-py,', 'healthy', 'exercise,', 'prevented', 'moping,', 'and', 'offered', 'that', 'greatest', 'of', 'all', 'treasures,', 'which', 'is', 'Hope,', 'sir."', '"Hope,"', 'muttered', 'Moist', 'glumly.', '"Not', 'upset,', 'are', 'you,', 'sir?"', '"Upset?', 'Why', 'should', 'I', 'be', 'upset,', 'Mr.', 'Wilkinson?"', '"Only', 'the', 'last', 'bloke', 'we', 'had', 'in', 'this', 'cell,', 'he', 'managed', 'to', 'get', 'down', 'that', 'drain,', 'sir.', 'Very', 'small', 'man.', 'Very', 'agile."']

Your turn:

What is the type of the variable words?

To get its length, we can use the len function:

len(words)
590

Now, let’s try to count how many times the word the is in the text.

Your turn:

We could use:

text.count("the") + text.count("The")
49

but it won’t answer our question. Why?

Instead, we should use the list of words that we called words and count how many of them are equal to the. We do this with a loop:

# We set our counter (the number of occurrences) to zero:
occurrences = 0

# And now we can use a loop to test the words one by one and add 1 to our counter each time the equality returns true
for word in words:
    if word == "the" or word == "The":
        occurrences += 1

print(occurrences)        
36

An alternative syntax that looks a lot more elegant is the following:

sum(word == "the" or word == "The" for word in words)
36

However, elegance and short syntax don’t mean fast code.

We can benchmark Python code very easy when we use Jupyter or IPython by using the magic %%timeit at the top of a code cell.

Let’s try it:

%%timeit

# We set our counter (the number of occurrences) to zero:
occurrences = 0

# And now we can use a loop to test the words one by one and add 1 to our counter each time the equality returns true
for word in words:
    if word == "the" or word == "The":
        occurrences += 1
9.52 μs ± 510 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I removed the print function so that we don’t end up printing the result a bunch of times: timeit does a lot of tests and takes the average. At each run, we would have a printed result!

And for the other method

%%timeit

occurrences = sum(word == "the" or word == "The" for word in words)
24.2 μs ± 243 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

To make a fair comparison with the previous expression, I am not printing the result here either, but assigning it to a variable.

As you can see, the short neat-looking expression takes more than twice the time of the not so nice-looking one. Without benchmarking, it is very hard to predict what code is efficient.

Removing punctuation

If you were trying to count other words, things would get even harder: the word “sir” for instance appears in sir., sir,, sir.", sir?". To do a cleaner job and get our method to work for any word, we need to remove the punctuation.

Step one, we remove the punctuation from our text string:

import string

clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)
They say that the prospect of being hanged in the morning concentrates a mans mind wonderfully unfortunately what the mind inevitably concentrates on is that in the morning it will be in a body that is going to be hanged
The man going to be hanged had been named Moist von Lipwig by doting if unwise parents but he was not going to embarrass the name insofar as that was still possible by being hung under it To the world in general and particularly on that bit of it known as the death warrant he was Alfred Spangler
And he took a more positive approach to the situation and had concentrated his mind on the prospect of not being hanged in the morning and most particularly on the prospect of removing all the crumbling mortar from around a stone in his cell wall with a spoon So far the work had taken him five weeks and reduced the spoon to something like a nail file Fortunately no one ever came to change the bedding here or else they would have discovered the worlds heaviest mattress
It was a large and heavy stone that was currently the object of his attentions and at some point a huge staple had been hammered into it as an anchor for manacles
Moist sat down facing the wall gripped the iron ring in both hands braced his legs against the stones on either side and heaved
His shoulders caught fire and a red mist filled his vision but the block slid out with a faint and inappropriate tinkling noise Moist managed to ease it away from the hole and peered inside
At the far end was another block and the mortar around it looked suspiciously strong and fresh
Just in front of it was a new spoon It was shiny
As he studied it he heard the clapping behind him He turned his head tendons twanging a little riff of agony and saw several of the wardens watching him through the bars
Well done Mr Spangler said one of them Ron here owes me five dollars I told him you were a sticker Hes a sticker I said
You set this up did you Mr Wilkinson said Moist weakly watching the glint of light on the spoon
Oh not us sir Lord Vetinaris orders He insists that all condemned prisoners should be offered the prospect of freedom
Freedom But theres a damn great stone through there
Yes there is that sir yes there is that said the warden Its only the prospect you see Not actual free freedom as such Hah thatd be a bit daft eh
I suppose so yes said Moist He didnt say you bastards The wardens had treated him quite civilly these past six weeks and he made a point of getting on with people He was very very good at it People skills were part of his stockintrade they were nearly the whole of it
Besides these people had big sticks So speaking carefully he added Some people might consider this cruel Mr Wilkinson
Yes sir we asked him about that sir but he said no it wasnt He said it providedhis forehead wrinkled occyoupayshunall therrappy healthy exercise prevented moping and offered that greatest of all treasures which is Hope sir
Hope muttered Moist glumly
Not upset are you sir
Upset Why should I be upset Mr Wilkinson
Only the last bloke we had in this cell he managed to get down that drain sir Very small man Very agile

And now we split it into words:

clean_words = clean_text.split()
print(clean_words)
['They', 'say', 'that', 'the', 'prospect', 'of', 'being', 'hanged', 'in', 'the', 'morning', 'concentrates', 'a', 'mans', 'mind', 'wonderfully', 'unfortunately', 'what', 'the', 'mind', 'inevitably', 'concentrates', 'on', 'is', 'that', 'in', 'the', 'morning', 'it', 'will', 'be', 'in', 'a', 'body', 'that', 'is', 'going', 'to', 'be', 'hanged', 'The', 'man', 'going', 'to', 'be', 'hanged', 'had', 'been', 'named', 'Moist', 'von', 'Lipwig', 'by', 'doting', 'if', 'unwise', 'parents', 'but', 'he', 'was', 'not', 'going', 'to', 'embarrass', 'the', 'name', 'insofar', 'as', 'that', 'was', 'still', 'possible', 'by', 'being', 'hung', 'under', 'it', 'To', 'the', 'world', 'in', 'general', 'and', 'particularly', 'on', 'that', 'bit', 'of', 'it', 'known', 'as', 'the', 'death', 'warrant', 'he', 'was', 'Alfred', 'Spangler', 'And', 'he', 'took', 'a', 'more', 'positive', 'approach', 'to', 'the', 'situation', 'and', 'had', 'concentrated', 'his', 'mind', 'on', 'the', 'prospect', 'of', 'not', 'being', 'hanged', 'in', 'the', 'morning', 'and', 'most', 'particularly', 'on', 'the', 'prospect', 'of', 'removing', 'all', 'the', 'crumbling', 'mortar', 'from', 'around', 'a', 'stone', 'in', 'his', 'cell', 'wall', 'with', 'a', 'spoon', 'So', 'far', 'the', 'work', 'had', 'taken', 'him', 'five', 'weeks', 'and', 'reduced', 'the', 'spoon', 'to', 'something', 'like', 'a', 'nail', 'file', 'Fortunately', 'no', 'one', 'ever', 'came', 'to', 'change', 'the', 'bedding', 'here', 'or', 'else', 'they', 'would', 'have', 'discovered', 'the', 'worlds', 'heaviest', 'mattress', 'It', 'was', 'a', 'large', 'and', 'heavy', 'stone', 'that', 'was', 'currently', 'the', 'object', 'of', 'his', 'attentions', 'and', 'at', 'some', 'point', 'a', 'huge', 'staple', 'had', 'been', 'hammered', 'into', 'it', 'as', 'an', 'anchor', 'for', 'manacles', 'Moist', 'sat', 'down', 'facing', 'the', 'wall', 'gripped', 'the', 'iron', 'ring', 'in', 'both', 'hands', 'braced', 'his', 'legs', 'against', 'the', 'stones', 'on', 'either', 'side', 'and', 'heaved', 'His', 'shoulders', 'caught', 'fire', 'and', 'a', 'red', 'mist', 'filled', 'his', 'vision', 'but', 'the', 'block', 'slid', 'out', 'with', 'a', 'faint', 'and', 'inappropriate', 'tinkling', 'noise', 'Moist', 'managed', 'to', 'ease', 'it', 'away', 'from', 'the', 'hole', 'and', 'peered', 'inside', 'At', 'the', 'far', 'end', 'was', 'another', 'block', 'and', 'the', 'mortar', 'around', 'it', 'looked', 'suspiciously', 'strong', 'and', 'fresh', 'Just', 'in', 'front', 'of', 'it', 'was', 'a', 'new', 'spoon', 'It', 'was', 'shiny', 'As', 'he', 'studied', 'it', 'he', 'heard', 'the', 'clapping', 'behind', 'him', 'He', 'turned', 'his', 'head', 'tendons', 'twanging', 'a', 'little', 'riff', 'of', 'agony', 'and', 'saw', 'several', 'of', 'the', 'wardens', 'watching', 'him', 'through', 'the', 'bars', 'Well', 'done', 'Mr', 'Spangler', 'said', 'one', 'of', 'them', 'Ron', 'here', 'owes', 'me', 'five', 'dollars', 'I', 'told', 'him', 'you', 'were', 'a', 'sticker', 'Hes', 'a', 'sticker', 'I', 'said', 'You', 'set', 'this', 'up', 'did', 'you', 'Mr', 'Wilkinson', 'said', 'Moist', 'weakly', 'watching', 'the', 'glint', 'of', 'light', 'on', 'the', 'spoon', 'Oh', 'not', 'us', 'sir', 'Lord', 'Vetinaris', 'orders', 'He', 'insists', 'that', 'all', 'condemned', 'prisoners', 'should', 'be', 'offered', 'the', 'prospect', 'of', 'freedom', 'Freedom', 'But', 'theres', 'a', 'damn', 'great', 'stone', 'through', 'there', 'Yes', 'there', 'is', 'that', 'sir', 'yes', 'there', 'is', 'that', 'said', 'the', 'warden', 'Its', 'only', 'the', 'prospect', 'you', 'see', 'Not', 'actual', 'free', 'freedom', 'as', 'such', 'Hah', 'thatd', 'be', 'a', 'bit', 'daft', 'eh', 'I', 'suppose', 'so', 'yes', 'said', 'Moist', 'He', 'didnt', 'say', 'you', 'bastards', 'The', 'wardens', 'had', 'treated', 'him', 'quite', 'civilly', 'these', 'past', 'six', 'weeks', 'and', 'he', 'made', 'a', 'point', 'of', 'getting', 'on', 'with', 'people', 'He', 'was', 'very', 'very', 'good', 'at', 'it', 'People', 'skills', 'were', 'part', 'of', 'his', 'stockintrade', 'they', 'were', 'nearly', 'the', 'whole', 'of', 'it', 'Besides', 'these', 'people', 'had', 'big', 'sticks', 'So', 'speaking', 'carefully', 'he', 'added', 'Some', 'people', 'might', 'consider', 'this', 'cruel', 'Mr', 'Wilkinson', 'Yes', 'sir', 'we', 'asked', 'him', 'about', 'that', 'sir', 'but', 'he', 'said', 'no', 'it', 'wasnt', 'He', 'said', 'it', 'providedhis', 'forehead', 'wrinkled', 'occyoupayshunall', 'therrappy', 'healthy', 'exercise', 'prevented', 'moping', 'and', 'offered', 'that', 'greatest', 'of', 'all', 'treasures', 'which', 'is', 'Hope', 'sir', 'Hope', 'muttered', 'Moist', 'glumly', 'Not', 'upset', 'are', 'you', 'sir', 'Upset', 'Why', 'should', 'I', 'be', 'upset', 'Mr', 'Wilkinson', 'Only', 'the', 'last', 'bloke', 'we', 'had', 'in', 'this', 'cell', 'he', 'managed', 'to', 'get', 'down', 'that', 'drain', 'sir', 'Very', 'small', 'man', 'Very', 'agile']

This is a much better list to work from and this one will work for any word. For the word “sir” for instance, we would do:

occurrences = 0

for word in clean_words:
    if word == "sir" or word == "Sir":
        occurrences += 1

print(occurrences)
7
clean_text.lower()
'they say that the prospect of being hanged in the morning concentrates a mans mind wonderfully unfortunately what the mind inevitably concentrates on is that in the morning it will be in a body that is going to be hanged\nthe man going to be hanged had been named moist von lipwig by doting if unwise parents but he was not going to embarrass the name insofar as that was still possible by being hung under it to the world in general and particularly on that bit of it known as the death warrant he was alfred spangler\nand he took a more positive approach to the situation and had concentrated his mind on the prospect of not being hanged in the morning and most particularly on the prospect of removing all the crumbling mortar from around a stone in his cell wall with a spoon so far the work had taken him five weeks and reduced the spoon to something like a nail file fortunately no one ever came to change the bedding here or else they would have discovered the worlds heaviest mattress\nit was a large and heavy stone that was currently the object of his attentions and at some point a huge staple had been hammered into it as an anchor for manacles\nmoist sat down facing the wall gripped the iron ring in both hands braced his legs against the stones on either side and heaved\nhis shoulders caught fire and a red mist filled his vision but the block slid out with a faint and inappropriate tinkling noise moist managed to ease it away from the hole and peered inside\nat the far end was another block and the mortar around it looked suspiciously strong and fresh\njust in front of it was a new spoon it was shiny\nas he studied it he heard the clapping behind him he turned his head tendons twanging a little riff of agony and saw several of the wardens watching him through the bars\nwell done mr spangler said one of them ron here owes me five dollars i told him you were a sticker hes a sticker i said\nyou set this up did you mr wilkinson said moist weakly watching the glint of light on the spoon\noh not us sir lord vetinaris orders he insists that all condemned prisoners should be offered the prospect of freedom\nfreedom but theres a damn great stone through there\nyes there is that sir yes there is that said the warden its only the prospect you see not actual free freedom as such hah thatd be a bit daft eh\ni suppose so yes said moist he didnt say you bastards the wardens had treated him quite civilly these past six weeks and he made a point of getting on with people he was very very good at it people skills were part of his stockintrade they were nearly the whole of it\nbesides these people had big sticks so speaking carefully he added some people might consider this cruel mr wilkinson\nyes sir we asked him about that sir but he said no it wasnt he said it providedhis forehead wrinkled occyoupayshunall therrappy healthy exercise prevented moping and offered that greatest of all treasures which is hope sir\nhope muttered moist glumly\nnot upset are you sir\nupset why should i be upset mr wilkinson\nonly the last bloke we had in this cell he managed to get down that drain sir very small man very agile\n'

Removing case

Now, having to look for the word of interest with and without capital letter as we have been doing so far is not the most robust method: what if the text had “SIR” in all caps? After all, Death in Pratchett novels speaks in all caps! Of course, we could add this as a third option (if word == "sir" or word == "Sir" or word == "SIR"), but that is becoming a little tedious.

A better solution is to turn the whole text into lower case before splitting it into words. That way we don’t have to worry about case.

Let’s remove all capital letters:

final_text = clean_text.lower()

Now we split it into words:

final_words = final_text.split()
print(final_words)
['they', 'say', 'that', 'the', 'prospect', 'of', 'being', 'hanged', 'in', 'the', 'morning', 'concentrates', 'a', 'mans', 'mind', 'wonderfully', 'unfortunately', 'what', 'the', 'mind', 'inevitably', 'concentrates', 'on', 'is', 'that', 'in', 'the', 'morning', 'it', 'will', 'be', 'in', 'a', 'body', 'that', 'is', 'going', 'to', 'be', 'hanged', 'the', 'man', 'going', 'to', 'be', 'hanged', 'had', 'been', 'named', 'moist', 'von', 'lipwig', 'by', 'doting', 'if', 'unwise', 'parents', 'but', 'he', 'was', 'not', 'going', 'to', 'embarrass', 'the', 'name', 'insofar', 'as', 'that', 'was', 'still', 'possible', 'by', 'being', 'hung', 'under', 'it', 'to', 'the', 'world', 'in', 'general', 'and', 'particularly', 'on', 'that', 'bit', 'of', 'it', 'known', 'as', 'the', 'death', 'warrant', 'he', 'was', 'alfred', 'spangler', 'and', 'he', 'took', 'a', 'more', 'positive', 'approach', 'to', 'the', 'situation', 'and', 'had', 'concentrated', 'his', 'mind', 'on', 'the', 'prospect', 'of', 'not', 'being', 'hanged', 'in', 'the', 'morning', 'and', 'most', 'particularly', 'on', 'the', 'prospect', 'of', 'removing', 'all', 'the', 'crumbling', 'mortar', 'from', 'around', 'a', 'stone', 'in', 'his', 'cell', 'wall', 'with', 'a', 'spoon', 'so', 'far', 'the', 'work', 'had', 'taken', 'him', 'five', 'weeks', 'and', 'reduced', 'the', 'spoon', 'to', 'something', 'like', 'a', 'nail', 'file', 'fortunately', 'no', 'one', 'ever', 'came', 'to', 'change', 'the', 'bedding', 'here', 'or', 'else', 'they', 'would', 'have', 'discovered', 'the', 'worlds', 'heaviest', 'mattress', 'it', 'was', 'a', 'large', 'and', 'heavy', 'stone', 'that', 'was', 'currently', 'the', 'object', 'of', 'his', 'attentions', 'and', 'at', 'some', 'point', 'a', 'huge', 'staple', 'had', 'been', 'hammered', 'into', 'it', 'as', 'an', 'anchor', 'for', 'manacles', 'moist', 'sat', 'down', 'facing', 'the', 'wall', 'gripped', 'the', 'iron', 'ring', 'in', 'both', 'hands', 'braced', 'his', 'legs', 'against', 'the', 'stones', 'on', 'either', 'side', 'and', 'heaved', 'his', 'shoulders', 'caught', 'fire', 'and', 'a', 'red', 'mist', 'filled', 'his', 'vision', 'but', 'the', 'block', 'slid', 'out', 'with', 'a', 'faint', 'and', 'inappropriate', 'tinkling', 'noise', 'moist', 'managed', 'to', 'ease', 'it', 'away', 'from', 'the', 'hole', 'and', 'peered', 'inside', 'at', 'the', 'far', 'end', 'was', 'another', 'block', 'and', 'the', 'mortar', 'around', 'it', 'looked', 'suspiciously', 'strong', 'and', 'fresh', 'just', 'in', 'front', 'of', 'it', 'was', 'a', 'new', 'spoon', 'it', 'was', 'shiny', 'as', 'he', 'studied', 'it', 'he', 'heard', 'the', 'clapping', 'behind', 'him', 'he', 'turned', 'his', 'head', 'tendons', 'twanging', 'a', 'little', 'riff', 'of', 'agony', 'and', 'saw', 'several', 'of', 'the', 'wardens', 'watching', 'him', 'through', 'the', 'bars', 'well', 'done', 'mr', 'spangler', 'said', 'one', 'of', 'them', 'ron', 'here', 'owes', 'me', 'five', 'dollars', 'i', 'told', 'him', 'you', 'were', 'a', 'sticker', 'hes', 'a', 'sticker', 'i', 'said', 'you', 'set', 'this', 'up', 'did', 'you', 'mr', 'wilkinson', 'said', 'moist', 'weakly', 'watching', 'the', 'glint', 'of', 'light', 'on', 'the', 'spoon', 'oh', 'not', 'us', 'sir', 'lord', 'vetinaris', 'orders', 'he', 'insists', 'that', 'all', 'condemned', 'prisoners', 'should', 'be', 'offered', 'the', 'prospect', 'of', 'freedom', 'freedom', 'but', 'theres', 'a', 'damn', 'great', 'stone', 'through', 'there', 'yes', 'there', 'is', 'that', 'sir', 'yes', 'there', 'is', 'that', 'said', 'the', 'warden', 'its', 'only', 'the', 'prospect', 'you', 'see', 'not', 'actual', 'free', 'freedom', 'as', 'such', 'hah', 'thatd', 'be', 'a', 'bit', 'daft', 'eh', 'i', 'suppose', 'so', 'yes', 'said', 'moist', 'he', 'didnt', 'say', 'you', 'bastards', 'the', 'wardens', 'had', 'treated', 'him', 'quite', 'civilly', 'these', 'past', 'six', 'weeks', 'and', 'he', 'made', 'a', 'point', 'of', 'getting', 'on', 'with', 'people', 'he', 'was', 'very', 'very', 'good', 'at', 'it', 'people', 'skills', 'were', 'part', 'of', 'his', 'stockintrade', 'they', 'were', 'nearly', 'the', 'whole', 'of', 'it', 'besides', 'these', 'people', 'had', 'big', 'sticks', 'so', 'speaking', 'carefully', 'he', 'added', 'some', 'people', 'might', 'consider', 'this', 'cruel', 'mr', 'wilkinson', 'yes', 'sir', 'we', 'asked', 'him', 'about', 'that', 'sir', 'but', 'he', 'said', 'no', 'it', 'wasnt', 'he', 'said', 'it', 'providedhis', 'forehead', 'wrinkled', 'occyoupayshunall', 'therrappy', 'healthy', 'exercise', 'prevented', 'moping', 'and', 'offered', 'that', 'greatest', 'of', 'all', 'treasures', 'which', 'is', 'hope', 'sir', 'hope', 'muttered', 'moist', 'glumly', 'not', 'upset', 'are', 'you', 'sir', 'upset', 'why', 'should', 'i', 'be', 'upset', 'mr', 'wilkinson', 'only', 'the', 'last', 'bloke', 'we', 'had', 'in', 'this', 'cell', 'he', 'managed', 'to', 'get', 'down', 'that', 'drain', 'sir', 'very', 'small', 'man', 'very', 'agile']

Your turn:

What would the code look like now to count the number of times the word “sir” appears?

Counting unique words

Yet something else we can count is the number of unique words in the text. The simplest way to do this is to turn our list of words into a set and see how many elements this set contains:

len(set(final_words))
292

Extracting characters from strings

Indexing

Let’s go back to our text. Remember that we have this object text which is a list.

type(text)
str

You can extract characters from strings by indexing.

Indexing in Python is done with square brackets and starts at 0 (the first element has index 0). This means that we can extract the first character with:

print(text[0])
T

Your turn:

How would you index the 4th element? Try it out. It should return “y”.

You can extract the last element with a minus sign (and this time, the indexing starts at 1):

print(text[-1])

We aren’t getting any output here because the last character is the special character \n which encodes for a line break. You can see it when you don’t use the print function (print makes things look nicer and transforms those characters into what they represent):

text[-1]
'\n'

Your turn:

Question 1:
How would you get the last letter of the text?

Question 2:
How would you index the 11th element from the end? Give it a try. You should get “V”.

Slicing

You can also extract multiple contiguous elements with a slice. A slice is also defined with square brackets, but this time you add a colon in it. Left of the colon is the start of the slice and right of the colon is the end of the slice.

In Python, the left element of a slice is included, but the right element is excluded.

First, let’s omit both indices on either side of the colon:

print(text[:])
They say that the prospect of being hanged in the morning concentrates a man's mind wonderfully; unfortunately, what the mind inevitably concentrates on is that, in the morning, it will be in a body that is going to be hanged.
The man going to be hanged had been named Moist von Lipwig by doting if unwise parents, but he was not going to embarrass the name, insofar as that was still possible, by being hung under it. To the world in general, and particularly on that bit of it known as the death warrant, he was Alfred Spangler.
And he took a more positive approach to the situation and had concentrated his mind on the prospect of not being hanged in the morning, and, most particularly, on the prospect of removing all the crumbling mortar from around a stone in his cell wall with a spoon. So far the work had taken him five weeks and reduced the spoon to something like a nail file. Fortunately, no one ever came to change the bedding here, or else they would have discovered the world's heaviest mattress.
It was a large and heavy stone that was currently the object of his attentions, and, at some point, a huge staple had been hammered into it as an anchor for manacles.
Moist sat down facing the wall, gripped the iron ring in both hands, braced his legs against the stones on either side, and heaved.
His shoulders caught fire, and a red mist filled his vision, but the block slid out with a faint and inappropriate tinkling noise. Moist managed to ease it away from the hole and peered inside.
At the far end was another block, and the mortar around it looked suspiciously strong and fresh.
Just in front of it was a new spoon. It was shiny.
As he studied it, he heard the clapping behind him. He turned his head, tendons twanging a little riff of agony, and saw several of the wardens watching him through the bars.
"Well done, Mr. Spangler!" said one of them. "Ron here owes me five dollars! I told him you were a sticker!! 'He's a sticker,' I said!"
"You set this up, did you, Mr. Wilkinson?" said Moist weakly, watching the glint of light on the spoon.
"Oh, not us, sir. Lord Vetinari's orders. He insists that all condemned prisoners should be offered the prospect of freedom."
"Freedom? But there's a damn great stone through there!"
"Yes, there is that, sir, yes, there is that," said the warden. "It's only the prospect, you see. Not actual free freedom as such. Hah, that'd be a bit daft, eh?"
"I suppose so, yes," said Moist. He didn't say "you bastards." The wardens had treated him quite civilly these past six weeks, and he made a point of getting on with people. He was very, very good at it. People skills were part of his stock-in-trade; they were nearly the whole of it.
Besides, these people had big sticks. So, speaking carefully, he added: "Some people might consider this cruel, Mr. Wilkinson."
"Yes, sir, we asked him about that, sir, but he said no, it wasn't. He said it provided"--his forehead wrinkled "--occ-you-pay-shun-all ther-rap-py, healthy exercise, prevented moping, and offered that greatest of all treasures, which is Hope, sir."
"Hope," muttered Moist glumly.
"Not upset, are you, sir?"
"Upset? Why should I be upset, Mr. Wilkinson?"
"Only the last bloke we had in this cell, he managed to get down that drain, sir. Very small man. Very agile."

This returns the full text. This is because when a slice boundary is omitted, by default it starts at the very beginning of the object you are slicing.

We can test that we indeed get the full text by comparing it to the non-sliced version of text:

text[:] == text
True

Now, let’s slice the first 10 elements of text:

print(text[:10])
They say t

Let’s explain this code a bit:

We want our slice to start at the beginning of the text, so we are omitting that boundary (we could also use 0 left of the colon).

Because indexing starts at 0, the 10th element is actually not “t”, but the following “h”. The reason we get “t” rather than “h” is because the right boundary of a slice is excluded.

Your turn:

Question 1:
Try to write some code that will return “prospect”.

Question 2:
Now, remember how we created the words object earlier? Try to use it to get the same result.

Striding

A last way to extract characters out of a string is to use strides. A stride is defined with square brackets and 3 values separated by colons. The first value is the left boundary (included), the second value is the right boundary (excluded), and the third value is the step. By default (if omitted), the step is 1.

Your turn:

Question 1:
What do you think that text[::] would return?

Question 2:
How would you test it?

Question 3:
How would you get every 3rd character of the whole text?

Now, a fun one: the step can also take a negative value. With -1, we get the text backward! This is because - indicates that we want to step from the end and 1 means that we want every character:

print(text[::-1])

".eliga yreV .nam llams yreV .ris ,niard taht nwod teg ot deganam eh ,llec siht ni dah ew ekolb tsal eht ylnO"
"?nosnikliW .rM ,tespu eb I dluohs yhW ?tespU"
"?ris ,uoy era ,tespu toN"
.ylmulg tsioM derettum ",epoH"
".ris ,epoH si hcihw ,serusaert lla fo tsetaerg taht dereffo dna ,gnipom detneverp ,esicrexe yhtlaeh ,yp-par-reht lla-nuhs-yap-uoy-cco--" delknirw daeherof sih--"dedivorp ti dias eH .t'nsaw ti ,on dias eh tub ,ris ,taht tuoba mih deksa ew ,ris ,seY"
".nosnikliW .rM ,leurc siht redisnoc thgim elpoep emoS" :dedda eh ,ylluferac gnikaeps ,oS .skcits gib dah elpoep eseht ,sediseB
.ti fo elohw eht ylraen erew yeht ;edart-ni-kcots sih fo trap erew slliks elpoeP .ti ta doog yrev ,yrev saw eH .elpoep htiw no gnitteg fo tniop a edam eh dna ,skeew xis tsap eseht yllivic etiuq mih detaert dah snedraw ehT ".sdratsab uoy" yas t'ndid eH .tsioM dias ",sey ,os esoppus I"
"?he ,tfad tib a eb d'taht ,haH .hcus sa modeerf eerf lautca toN .ees uoy ,tcepsorp eht ylno s'tI" .nedraw eht dias ",taht si ereht ,sey ,ris ,taht si ereht ,seY"
"!ereht hguorht enots taerg nmad a s'ereht tuB ?modeerF"
".modeerf fo tcepsorp eht dereffo eb dluohs srenosirp denmednoc lla taht stsisni eH .sredro s'iraniteV droL .ris ,su ton ,hO"
.noops eht no thgil fo tnilg eht gnihctaw ,ylkaew tsioM dias "?nosnikliW .rM ,uoy did ,pu siht tes uoY"
"!dias I ',rekcits a s'eH' !!rekcits a erew uoy mih dlot I !srallod evif em sewo ereh noR" .meht fo eno dias "!relgnapS .rM ,enod lleW"
.srab eht hguorht mih gnihctaw snedraw eht fo lareves was dna ,ynoga fo ffir elttil a gnignawt snodnet ,daeh sih denrut eH .mih dniheb gnippalc eht draeh eh ,ti deiduts eh sA
.ynihs saw tI .noops wen a saw ti fo tnorf ni tsuJ
.hserf dna gnorts ylsuoicipsus dekool ti dnuora ratrom eht dna ,kcolb rehtona saw dne raf eht tA
.edisni dereep dna eloh eht morf yawa ti esae ot deganam tsioM .esion gnilknit etairporppani dna tniaf a htiw tuo dils kcolb eht tub ,noisiv sih dellif tsim der a dna ,erif thguac sredluohs siH
.devaeh dna ,edis rehtie no senots eht tsniaga sgel sih decarb ,sdnah htob ni gnir nori eht deppirg ,llaw eht gnicaf nwod tas tsioM
.selcanam rof rohcna na sa ti otni deremmah neeb dah elpats eguh a ,tniop emos ta ,dna ,snoitnetta sih fo tcejbo eht yltnerruc saw taht enots yvaeh dna egral a saw tI
.sserttam tseivaeh s'dlrow eht derevocsid evah dluow yeht esle ro ,ereh gniddeb eht egnahc ot emac reve eno on ,yletanutroF .elif lian a ekil gnihtemos ot noops eht decuder dna skeew evif mih nekat dah krow eht raf oS .noops a htiw llaw llec sih ni enots a dnuora morf ratrom gnilbmurc eht lla gnivomer fo tcepsorp eht no ,ylralucitrap tsom ,dna ,gninrom eht ni degnah gnieb ton fo tcepsorp eht no dnim sih detartnecnoc dah dna noitautis eht ot hcaorppa evitisop erom a koot eh dnA
.relgnapS derflA saw eh ,tnarraw htaed eht sa nwonk ti fo tib taht no ylralucitrap dna ,lareneg ni dlrow eht oT .ti rednu gnuh gnieb yb ,elbissop llits saw taht sa rafosni ,eman eht ssarrabme ot gniog ton saw eh tub ,stnerap esiwnu fi gnitod yb giwpiL nov tsioM deman neeb dah degnah eb ot gniog nam ehT
.degnah eb ot gniog si taht ydob a ni eb lliw ti ,gninrom eht ni ,taht si no setartnecnoc ylbativeni dnim eht tahw ,yletanutrofnu ;yllufrednow dnim s'nam a setartnecnoc gninrom eht ni degnah gnieb fo tcepsorp eht taht yas yehT

If you want to go much beyond this (e.g. sentences tokenization, natural language processing (NLP), etc.), you probably want to install a library for this such as NLTK or spaCy.