Working with digital texts: regular expressions on the command line

Working with my colleagues Marty Steer and Jonathan Blaney (Digital Humanities at the School of Advanced Study), I have been teaching sessions on “Working with Texts” for the LAHP Digital Humanities short course this month. We have provided a range of introductions to markdown, html, xml, TEI-XML, document ontologies, the command line, and regular expressions. The second session in particular inspired a slideshow tutorial of using regular expressions on the command line for literary texts. I also wrote almost the entire slideshow in markdown, my favorite language for authoring web content.

In the spirit of the Institute of English Studies’s recently launched Literary Appreciation Lab, I am beginning to share (and reflect on) recent adventures in teaching and researching digital approaches to literature. In the spirit of that:

Click here to access the tutorial slides for using regular expressions on the command line.

Also, for more information on the “Remark js” html template for creating markdown slideshows, check out Remark’s GitHub repo. It is a fairly straightforward––and powerful––way to author html slideshows. Why create html slideshows? I find it easier to have all of my presentation material on a web browser––e.g., it is easier to open hyperlinks from the slideshow and navigate by simply switching browser tabs. Using html also gives the author more control of the document––there is no proprietary software involved (like with Powerpoint and Numbers). Any decent web browser will open the file with ease.


Now, for those of you who read through the slideshow, I will offer some brief and humble reflections.

The first is, I could not have learned regular expressions (regex) without the help of regex101––it allows you to not only test your regex but also to evaluate its efficiency (i.e., the number of steps taken to find a pattern match). I am aware that my regex (as reflected in my slideshow) possibly could be more efficient, but sometimes you stick with what works. With regex101, I can still test out alternatives for efficiency and consider adjustments.

Secondly, as with many adventures in computer-assisted literary analyses, I was interested to find the sheer preponderance of “madness” words in Shakespeare’s plays. I expected the words to come up more often in Melville––and even in Melville it is interesting to note that his “maddest” book, Pierre, has only about 27 occurrences of “madness” words compared to 51 in Moby-Dick. But 334 instances in Shakespeare––that is indeed more than I had expected. This led me to a 2016 article by Will Tosh that was published on the British Library’s web site.

Several “conceptual” investigations into madness in Shakespeare already exist, but not statistical ones. Even if this activity does not lead to a critical breakthrough, with the simple adjunct of grep searches through multiple files on the command line, you can see how quickly I was able to direct the computational results into questions about several literary texts. Questions which will be resolved with careful reading. I find that these are useful teaching examples of how close and so-called “distant” reading can be complementary activities. In a forthcoming series of posts, I will reflect on my recent work with a team of researchers at Melville’s Marginalia Online on digital text analyses of Melville’s reading of Homer, Shakespeare, and Milton. These essays on these analyses have just been published as a special issue on “Melville’s Hand” in Leviathan: A Journal of Melville Studies. In those essays we show how close and distant reading can be accomplished and communicated to a wide audience–in and in each piece there are plenty of new discoveries about Melville’s sources and their relation to his own work. (More on that soon.)

Finally, I give a bonus example on using xmllint to conduct XPath and grep-style searches. I’ll admit being new to using the command line to deal with xml files (as I have always used the oXygen text editor’s outstanding XPath and XSLT features), but I can also see the value of using the command line for xml searches and transformations. For example, if I cannot access oXygen, I will still be able to query or modify files. Also, if I am not so well-practiced with XQuery or something like the eXist database, I can use xmllint on the command line to query multiple files. I have also recently become interested in XMLStarlet, which is a popular alternative for querying and modifying xml on the command line, but that will have to wait for another day.

Leave a Reply

Your email address will not be published. Required fields are marked *