RegEx and the Command Line

class: center, middle

# "O, this way madness lies":

## Regular expressions on the command line

### https://bit.ly/2K6pvyF

---

# grep

- `grep` stands for global regular expressions. This is a powerful command to search through multiple files. It literally processes a text line by line and prints the results that match the regular expression.

- Syntax:

`$ grep [OPTIONS] PATTERN [FILE...]`

---
      ## Example 1: a simple word match in a file

- Download the corpus of text files [here](https://www.dropbox.com/sh/m2i9j6x34znhmxa/AAC05iy6z-3b2iCHmNU75xEya?dl=0).

- On the command line, use `cd` to navigate to "corpus" (i.e., the folder you just downloaded).

- Let's say we're interested in finding all instances of the word "worried" in Melville's *Moby-Dick*:
      --

`$ grep "worried" moby-dick.txt`

- Recall the syntax:

`$ grep [OPTIONS] PATTERN [FILE...]`

- The pattern is "worried" (all regular expressions are placed within quote marks), and the file is "moby-dick.txt". We can see the result is a hapax legomenon (a word that only occurs once).

- If we would like to find the exact line number in the txt file, we can add an -n option to the argument:

`$ grep -n "worried" moby-dick.txt`
      ---

## Example 2: another simple word match in a file

- To see how many times "Ahab" appears, start with: `grep "ahab" moby-dick.txt`

- Obviously "Ahab" is always capitalised in the book, since he is a main character. So we will enter an `-i` option in our argument to make our search case insensitive: `grep -i "ahab" moby-dick.txt`

- Use the `--color` option to make the results more obvious: `$ grep -i --color "ahab" moby-dick.txt`

- To count the results, just use a count option: `$ grep -i --count "ahab" moby-dick.txt`

- Now if I also want to match exactly the possessive uses of "Ahab", then I need another regex; in bash shell we invoke a powerful command: `egrep` (extended regex), which allows for more complicated pattern matches.

`$ egrep -i --count "\bahab('?\w?)" moby-dick.txt`

---

## Example 3.1: a (simple?) word match on multiple files

- Say I am interested in a very Melvillean word, such as "mad", or even a topic of "madness". The regex is apparently pretty straightforward: `$ grep --color "mad" *.txt`

- Notice how the file path begins with the regex `*` (the wildcard). This tells the machine to search for zero or more text files that have a .txt extension. It quickly accesses multiple files txt files in a corpus. But there's a problem with the results. Any guesses?

--
      - Say I do want to include results for words containing "mad" as well, like, say, "madness" and "maddening".
      --

`$ grep -i --color "\bmad\w*" *.txt`

- Notice here that we have included a word boundary regex (`\b`) at the beginning that makes sure the word begins with "mad", and at the end we match zero or more words (`\w*`).

- Interesting, but if you look closely, this expression finds any words that start with the characters "mad", which in our corpus includes "made" and "madame"---not relevant to us. So we can use regex to limit our search to just the word "mad":
      --

`$ egrep -i --color "\bmad\b" *.txt`

- Now use `--count` to generate a list of the instances.
      ---
      ## Example 3.2: a (simple?) word match on multiple files

- Fine; but we still might want other madness words with the root "mad", but we do not want the extra words like "madame". How would that look?

`$ egrep --color "\bmad\b|\bmadness|\bmaddening|\bmadman" *.txt`

- Wait?? There's still a problem?

- Yes: several of these files have a tricky word, "mad'st". We can introduce a strict option for egrep, `-w`, which tells the grep to only match the words in the pattern. In this case, this saves us a little trouble because we now know what words we want.

`$ egrep -w --color "mad\b|madness|maddening|madman" *.txt`

---

## Who is the maddest of them all?
      --

battle-pieces.txt:5

beale-natural-history-sperm-whale.txt:0

chapman-iliad.txt:25

chapman-odyssey.txt:21

confidence-man.txt:11

hawthorne-and-his-mosses.txt:2

israel-potter.txt:10

milton-complete.txt:13

moby-dick.txt:51

piazza-tales.txt:8

pierre.txt:27

**shakespeare-complete.txt:334**
      ---

## Example 4.1: Piping your results: simply

- Sometimes you want to put your results into a new file. The piping function on the command line is `>`. The resulting word results (with line numbers) could be rendered as such:

`$ egrep -w -n "mad\b|madness|maddening|madman" *.txt > mad-results.txt`

---
      ## Example 4.2: Piping your search results with `sed`

- The next major use of regex on the command line is `sed` (literally "stream editor"), a simple search and replace function that can be used for multiple files. Recall our problem with the word "mad'st". With `sed` we can regularise the word over multiple files using this syntax:

`$ sed -e "s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g" file.txt`

- Translated into our example:

`$ sed -e "s/mad'st/madest/g" chapman*.txt > chapman_clean.txt`

- I *would not* recommend doing that, but it's just an example. Another challenge, which I might recommend: suppose we wanted to regularise all of the past tense verbs with 'd? How would you search and replace all "'d "s with "ed"s?

`$ sed -e "s/\'d/ed/g" *.txt > past-tense.txt`

- Better yet, use `grep` to check that nothing went wrong:

`$ grep -i --color "\'d\b" past-tense.txt`

---

## Bonus Example 5! using `grep` within `xmllint` to search xml files

- `xmllint` is a command line tool that allows several options for searching xml files. It can also function as an XPath query engine for those without a good editor like oXygen.

- The syntax is pretty similar to `grep`: here we are using an XPath expression an option following the xmllint invocation that greps out the text of the xml node:

`$ xmllint --xpath "//line/text()" ozymandias.xml > ozymandias.txt`

- This just prints out all `<line>` elements in the document. You can now use `grep` to search just through the texts of only the `<line>` elements (which is important in an xml file because you will have searches that match strings of nodes other than lines): `$ grep "\b([A-z]{4})\b" ozymandias.txt` finds all four-letter words that are lines, for example.

---

## Last word about search and replace and xml

- Now suppose you want to move toward a more compliant TEI file for ozymandias.xml. For one file you would use a regex to search and replace each `<line>` element with a `<l>` (and then wrap that in an `<lg>`). But you might want to use the command line for multiple files that have the same issue. The syntax for that would be with `sed`: type in `$ sed -e "s/line/l/g" ozymandias.xml` and you'll see a nice rewriting of the line elements.

- **But** this should **not** be done, generally: most xml find-and-replace should be done with XPath and XSLT. There may be some cases when an element name (say, `<w>`) is not going to match with other characters in the file, so it might be worthwhile to use `sed` to your advantage. However, it's best to learn more about xml technologies XPath and XSLT! Any XSLT programmer uses a lot of regular expressions to navigate patterns in files.

---
      class: center, middle

### Thanks! Please do be in touch with any questions: christopher.ohge@sas.ac.uk