Searching & manipulating text

Author

Marie-Hélène Burle

Searching inside files with grep

cd ~/Desktop/data-shell/writing
more haiku.txt

First let’s search for text in files:

grep not haiku.txt     # let's find all lines that contain the word 'not'
grep day haiku.txt     # now search for word 'day'
grep -w day haiku.txt     # search for a separate word 'day' (not 'today', etc.)
grep -w today haiku.txt   # search for 'today'
grep -w Today haiku.txt   # search for 'Today'
grep -i -w today haiku.txt       # both upper and lower case 'today'
grep -n -i -w today haiku.txt    # -n prints out numbers the matching lines
grep -n -i -w -v the haiku.txt   # -v searches for lines that do not contain 'the'
man grep

More than two arguments to grep:

grep pattern file1 file2 file3   # all argument after the first one are assumed to be filenames
grep pattern *.txt   # the last argument will expand to the list of *.txt files
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
With searching comes loss
and the presence of absence:
"My Thesis" not found.
Yesterday it worked.
Today it is not working.
Software is like that.

From the above text, contained in the file haiku.txt, which command would result in the following output:

and the presence of absence:
  1. grep of haiku.txt
  2. grep -E of haiku.txt
  3. grep -w of haiku.txt

Here is a video on this topic.

Text manipulation

(This example was kindly provided by John Simpson.)

In this section we’ll use two tools for text manipulation: sed and tr. Our goal is to calculate the frequency of all dictionary words in the novel “The Invisible Man” by Herbert Wells (public domain). First, let’s apply our knowledge of grep to this text:

$ cd ~/Desktop/data-shell
$ ls   # shows wellsInvisibleMan.txt
$ wc wellsInvisibleMan.txt                          # number of lines, words, characters
$ grep invisible wellsInvisibleMan.txt              # see the invisible man
$ grep invisible wellsInvisibleMan.txt | wc -l      # returns 60; adding -w gives the same count
$ grep -i invisible wellsInvisibleMan.txt | wc -l   # returns 176 (includes: invisible Invisible INVISIBLE)

Let’s sidetrack for a second and see how we can use the “stream editor” sed:

$ sed 's/[iI]nvisible/supervisible/g' wellsInvisibleMan.txt > visibleMan.txt   # make him visible
$ cat wellsInvisibleMan.txt | sed 's/[iI]nvisible/supervisible/g' > visibleMan.txt   # this also works (standard input)
$ grep supervisible visibleMan.txt   # see what happened to the now visible man
$ grep -i invisible visibleMan.txt   # see what was not converted
$ man sed

Now let’s remove punctuation from the original file using “tr” (translate) command:

$ cat wellsInvisibleMan.txt | tr -d "[:punct:]" > invisibleNoPunct.txt    # tr only takes standard input
$ tail wellsInvisibleMan.txt
$ tail invisibleNoPunct.txt

Next convert all upper case to lower case:

$ cat invisibleNoPunct.txt | tr '[:upper:]' '[:lower:]' > invisibleClean.txt
$ tail invisibleClean.txt

Next replace spaces with new lines:

$ cat invisibleClean.txt | sed 's/ /\'$'\n/g' > invisibleList.txt   # \'$'\n is a shortcut for a new line
$ more invisibleList.txt

Next remove empty lines:

$ sed '/^$/d' invisibleList.txt  > invisibleCompact.txt

Next sort the list alphabetically, count each word’s occurrence, and remove duplicate words:

$ cat invisibleCompact.txt | sort | uniq -c > invisibleWords.txt
$ more invisibleWords.txt

Next sort the list into most frequent words:

$ cat invisibleWords.txt | sort -gr > invisibleFrequencyList.txt   # use 'man sort'
$ more invisibleFrequencyList.txt

You can watch a video for this topic after the workshop.

Quick reference:

sed 's/pattern1/pattern2/' filename    # replace pattern1 with pattern2, one per line
sed 's/pattern1/pattern2/g' filename   # same but multiple per line
sed 's|pattern1|pattern2|g' filename   # same

cat wellsInvisibleMan.txt | tr -d "[:punct:]" > invisibleNoPunct.txt       # remove punctuation; tr only takes standard input
cat invisibleNoPunct.txt | tr '[:upper:]' '[:lower:]' > invisibleClean.txt # convert all upper case to lower case:
cat invisibleClean.txt | sed 's/ /\'$'\n/g' > invisibleList.txt            # replace spaces with new lines;
                                                                           # \'$'\n is a shortcut for a new line
sed '/^$/d' invisibleList.txt  > invisibleCompact.txt   # remove empty lines
cat invisibleCompact.txt | sort | uniq -c > invisibleWords.txt   # sort the list alphabetically, count each word's occurrence
cat invisibleWords.txt | sort -gr > invisibleFrequencyList.txt   # sort the list into most frequent words

Write a script that takes an English-language file and print the list of its 100 most common words, along with the word count. Hint: use the workflow from the text manipulation video. Finally, convert this script into a bash function. (no need to type any answer)

Column-based text processing with awk scripting language

cd .../data-shell/writing
cat haiku.txt   # 11 lines

You can define inline awk scripts with braces surrounded by single quotation:

awk '{print $1}' haiku.txt       # $1 is the first field (word) in each line => processing columns
awk '{print $0}' haiku.txt       # $0 is the whole line
awk '{print}' haiku.txt          # the whole line is the default action
awk -Fa '{print $1}' haiku.txt   # can specify another separator with -F ("a" in this case)

You can use multiple commands inside your awk script:

echo Hello Tom > hello.txt
echo Hello John >> hello.txt
awk '{$2="Adam"; print $0}' hello.txt   # we replaced the second word in each line with "Adam"

Most common awk usage is to postprocess output of other commands:

/bin/ps aux    # display all running processes as multi-column output
/bin/ps aux | awk '{print $2 " " $11}'   # print only the process number and the command

Awk also takes patterns in addition to scripts:

awk '/Yesterday|Today/' haiku.txt   # print the lines that contain the words Yesterday or Today

And then you act on these patterns: if the pattern evaluates to True, then run the script:

awk '/Yesterday|Today/{print $3}' haiku.txt
awk '/Yesterday|Today/' haiku.txt | awk '{print $3}'   # same as previous line

Awk has a number of built-in variables; the most commonly used is NR:

awk 'NR>1' haiku.txt    # if NumberRecord >1 then print it (default action), i.e. skip the first line
awk 'NR>1{print $0}' haiku.txt   # last command expanded
awk 'NR>1 && NR < 5' haiku.txt   # print lines 2-4

Exercise: write a awk script to process cities.csv to print only town/city names and their population and store it in a separate file populations.csv. Try to do everything in a single-line command.

Quick reference:

ls -l | awk 'NR>3 {print $5 "  " $9}'   # print 5th and 9th columns starting with line 4
awk 'NR>1 && NR < 5' haiku.txt          # print lines 2-4
awk '/Yesterday|Today/' haiku.txt       # print lines that contain Yesterday or Today

Write a one-line command that finds 5 largest files in the current directory and prints only their names and file sizes in the human-readable format (indicating bytes, kB, MB, GB, …) in the decreasing file-size order. Hint: use find, xargs, and awk.

Let’s study together these commands:

$ source ~/projects/def-sponsor00/shared/fzf/.fzf.bash
$ kill -9 `/bin/ps aux | fzf | awk '{print $2}'`

Here is a video on this topic.