Installing R and RStudio

Before the session, make sure to download the R software package from http://www.r-project.org/.

Then download the latest version of RStudio at https://www.rstudio.com.

Part I: A Brief Intro to Programming and R

What exactly is programming?

Every computer program is a series of instructions—a sequence of separate, small commands. The art of programming is to take a general idea and break it apart into separate steps. (This may be just as important as learning the rules and syntax of a particular language.)

Programming (or code) consists of either imperative or declarative style. R uses imperative style, meaning it strings together instructions to the computer. (Declarative style involves telling the computer what the end result should be, like HTML code.) There are many subdivisions of imperative style, but the primary concern for beginning programmers should be procedural style: that is, describing the steps for achieving a task.

Each step/instruction is a statement—words, numbers, or equations that express a thought.

Why are there so many languages?

The central processing unit (CPU) of the computer does not understand any of them! The CPU only takes in machine code, which runs directly on the computer’s hardware. Machine code is basically unreadable, though: it’s a series of tiny numerical operations.

Several popular computer programming languages are actually translations of machine code; they are literally interpreted—as opposed to a compiled—languages. They bridge the gap between machine code/computer hardware and the human programmer. What we call our source code is our set of statements in our preferred language that interacts with machine code.

Source code is simply written in plain text in a text editor. Do not use a word processor.

The computer understands source code by the file extension. For us, that means the “.R” extension (and the R notebook is “.Rmd”).

While you do not need a special program to write code, it is usually a good idea to use an IDE (integrated development environment) to help you. Many people (like me) use the oXygen IDE for editing XML documents and creating transformations with XSLT. Python users often use Spyder, Pycharm, or Anaconda Jupyter Notebooks.

For R, use RStudio (more on that in a moment).

Why are we using R?

Short answer: because I like R. I have learned some Python and JavaScript, too, but for some reason R worked better for me. This suggests an important takeaway from this session: there is no single language that is better than any other. What you chose to work with will depend on what materials you are working on, what level of comfort you have with a given language, and what kinds of outputs you would like from your code.

For example, if I am primarily interested in text-based edition projects, I would be wise to work mostly with XML technologies: TEI-XML, XPath, XSLT, XQuery, XProc, just to name a few. However, I have seen people use Python and JavaScript to transform XML. While I would advocate XSLT for such an operation, it is better for you to use your preferred language to get things done.

XML, R, Python, and JavaScript are all open-source languages. R and Python are quite similar and both can basically perform the same tasks.

That said, R does have some distinct advantages:

  • The visualisation libraries are excellent. With RStudio, reporting your results is almost instantaneous.

  • R Markdown makes it easy to integrate the code and the results in a web browser.

  • It is a functional language (meaning almost everything is accomplished through functions), which works well for some.

  • It was built by data scientists and linguists, so it is optimal for doing statistical analyses with structured text and data sets. (Python is probably better for more general purpose and natural language processing tasks.)

  • It tends to be used by academics and researchers, so it works well with research questions (Python, on the other hand, is used more in private development.)

  • Strong user community: In CRAN (R’s open-source repository), around 12000 packages are currently available.

Some notable disadvantages:

  • It is slow with large data sets (in which case you might want a server instantiation of RStudio).

  • Its programming paradigm is somewhat non-standard. It is based on the commercial S and S-plus languages).

The R Environment (for those who are new to R)

When you first launch R, you will see a console:

R image

R image

This interface allows you to run R commands just like you would run commands on a Bash scripting shell.

When you open this file in RStudio, the command line interface (labeled “Console”) is below the editing window. When you run a code block in the editing window, you will see the results appear in the Console below.

About R Markdown

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The results can also be published as an HTML file.

A quick example: let’s do some math. Say I am making a travel budget, and I want to add the cost of hotel and flight prices for a trip to Seattle. The flight is £550 and the hotel price per night is £133. R can do the work for you.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. (On a Windows machine you would press [Windows button]+Shift+Enter.)

550 + 133
## [1] 683

R can make all kinds of calculations, so if you want to get the total cost of a five-day trip to Seattle, you can add an operator for multiplication.

550 + 133 * 5
## [1] 1215

R is an object-oriented programming language, so practically everything can be stored as an R object. The commands are either expressions or assignments. An expression (when written as a command) is evaluated and printed (unless specifically made invisible), but the value is lost. An assignment also evaluates an expression but the the value is stored in a variable (and the result is not automatically printed).

To make our calculations effective, we need to store these kinds of calculations in variables. Variables can be assigned with either <- or =. Let’s do that by comparing the price of a 5-day trip to Seattle to a 7-day trip to Paris.

sea.trip.v <- 550 + 133 * 5

paris.trip.v <- 110 + 90 * 7

sea.trip.v < paris.trip.v
## [1] FALSE

What is the most expensive trip?

Guess we should go to Paris. What if I just want to do both?

sea.trip.v + paris.trip.v
## [1] 1955

Suppose further that I wanted to add in an optional 3-day trip to New York City. I want to see which trip would be more expensive if I were to take two out of the three options.

nyc.trip.v <- 335 + 175 * 3

sea.and.nyc <- sea.trip.v + nyc.trip.v 

sea.and.paris <- sea.trip.v + paris.trip.v

paris.and.nyc <- paris.trip.v + nyc.trip.v

Above you can see how powerful even simple R programming can be: you can store mathemtical operations in named variables and use those variables to work with other variables (this becomes very important in word frequency calculations). You can also plot the results for quick assessment.

trips <- c(sea.and.nyc, sea.and.paris, paris.and.nyc)
barplot(trips, ylab = "Cost of each trip", names.arg = c("Seattle and NYC", "Seattle and Paris", "Paris and NYC"))

You see how this works, and how quickly one can store variables for even practical questions.

Reading Data in R

There are other important kinds of R data formats that you should know. The first is a vector, which is a single variable consisting of an ordered collection numbers and/or words. An easy way to create a vector is to use the c command, which basically means “combine.”

v1 <- c("i", "wait", "with", "bated", "breath")

# confirm the value of the variable by running v1
v1
## [1] "i"      "wait"   "with"   "bated"  "breath"
# identify a specific value by indicating it is brackets
v1[4]
## [1] "bated"

Get used to the functions that help you understand R: ? and example().

?c

example(c, echo = FALSE) # change the echo value to TRUE to get the results

The c function is widely used, but it is really only useful for creating small data sets. Many of you will probably want to load text files.

Jeff Rydberg-Cox provides some helpful tips for preparing data for R processing:

  • Download the text(s) from a source repository.

  • Remove extraneous material from the text(s).

  • Transform the text(s) to answer your research questions.

The best way to load text files is with the scan function. (The other important function is read.table, which handles csv files.) First, download a text file of Dickens’s Great Expectations onto your working directory.

dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")

You have now loaded Great Expectations into a variable called dickens.v.

With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.

length(dickens.v) # this finds the number of lines in the book
## [1] 3913
dickens.lower.v <- tolower(dickens.v) # this makes the whole text lowercased, and each sentence is now in a list

dickens.words <- strsplit(dickens.lower.v, "\\W") # strsplit is very important: it takes each sentence in the lowercased words vector and puts each word in a list by finding non-words, i.e., word boundaries
# each list item (word) corresponds to an element of the book's sentences that has been split. In the simplest case, x is a single character string, and strsplit outputs a one-item list.

class(dickens.words) # the class function tells you the data structure of your variable
## [1] "list"
dickens.words.v <- unlist(dickens.words)

class(dickens.words.v)
## [1] "character"
dickens.words.v[1:20] # find the first 20 ten words in Great Expectations
##  [1] "chapter"   "i"         "my"        "father"    "s"        
##  [6] "family"    "name"      "being"     "pirrip"    ""         
## [11] "and"       "my"        "christian" "name"      "philip"   
## [16] ""          "my"        "infant"    "tongue"    "could"

Did you notice the “\W” in the strsplit argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.

Also, did you notice the blank result on the 10th word? This requires a little clean-up step.

not.blanks.v <- which(dickens.words.v!="")

dickens.words.v <- dickens.words.v[not.blanks.v]

Extra white spaces often cause problems for text analysis.

dickens.words.v[1:20]
##  [1] "chapter"   "i"         "my"        "father"    "s"        
##  [6] "family"    "name"      "being"     "pirrip"    "and"      
## [11] "my"        "christian" "name"      "philip"    "my"       
## [16] "infant"    "tongue"    "could"     "make"      "of"

Voila! We might want to examine how many times the third result “father” occurs (the fourth word result, and one that will probably be an important word in this book).

length(dickens.words.v[which(dickens.words.v=="father")])
## [1] 69

Or produce a list of all unique words.

unique(sort(dickens.words.v, decreasing = FALSE))[1:50]
##  [1] "0037m"      "0072m"      "0082m"      "0132m"      "0189m"     
##  [6] "0223m"      "0242m"      "0245m"      "0279m"      "0295m"     
## [11] "0335m"      "0348m"      "0393m"      "0399m"      "1"         
## [16] "2"          "a"          "aback"      "abandoned"  "abased"    
## [21] "abashed"    "abbey"      "abear"      "abel"       "aberdeen"  
## [26] "aberration" "abet"       "abeyance"   "abhorrence" "abhorrent" 
## [31] "abhorring"  "abide"      "abided"     "abilities"  "ability"   
## [36] "abject"     "able"       "ablutions"  "aboard"     "abode"     
## [41] "abominate"  "about"      "above"      "abraham"    "abreast"   
## [46] "abroad"     "abrupt"     "abruptness" "absence"    "absent"

Here we find another problem: we find in our unique word list some odd non-words such as “0037m.” We should strip those out.

Exercise 1

Create a regular expression to remove those non-words in dickens.words.v? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful cheat sheet.

# the pattern for finding those numerical bits is "\\d+[a-z]*", or (more precisely) "\\d{4}\\w*"
# but b/c I am not sure if all of those instances start with four numbers I will keep the expression flexible
# also the regex above takes out other random numbers -- do you understand why? what's the difference between "\\d+[a-z]*" and "\\d+[a-z]"
# one way to handle this is to use gsub, a common regex-related function in R 

dickens.words.clean.v <- gsub("\\d+[a-z]*", "", dickens.words.v) 
# gsub is a function for stripping out stuff by means of a global regex search and replace: 
# after entering gsub, within the parentheses you enter the regex in quotes, then its replacement (which in this case is a blank), then the vector to which it applies

Now let’s re-run that not.blanks vector to strip out the blank you just added.

not.blanks.v <- which(dickens.words.clean.v!="")

dickens.words.clean.v <- dickens.words.clean.v[not.blanks.v]

unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]
##  [1] "a"           "aback"       "abandoned"   "abased"      "abashed"    
##  [6] "abbey"       "abear"       "abel"        "aberdeen"    "aberration" 
## [11] "abet"        "abeyance"    "abhorrence"  "abhorrent"   "abhorring"  
## [16] "abide"       "abided"      "abilities"   "ability"     "abject"     
## [21] "able"        "ablutions"   "aboard"      "abode"       "abominate"  
## [26] "about"       "above"       "abraham"     "abreast"     "abroad"     
## [31] "abrupt"      "abruptness"  "absence"     "absent"      "absolute"   
## [36] "absolutely"  "absolve"     "absorbed"    "abstinence"  "abstraction"
## [41] "abstracts"   "absurd"      "absurdest"   "absurdly"    "abundance"  
## [46] "abundantly"  "abyss"       "accept"      "acceptable"  "acceptance"

Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?

length(unique(dickens.words.clean.v))
## [1] 10744

Divide this by the amount of words in the whole book to calculate vocabulary density ratios.

unique.words <- length(unique(dickens.words.clean.v))

total.words <- length(dickens.words.clean.v)

unique.words/total.words 
## [1] 0.05688569
# you could do this quicker this way: 
# length(unique(dickens.words.v))/length(dickens.words.v) 
# BUT it's good to get into the practice of storing results in variables

That’s actually a fairly small density number, 5.7% (Moby-Dick by comparison is about 8%).

The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we’ll see later.

Flow control

Flow control involves stochastic simulation, or repetitive operations or pattern recognition—two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax

for (name in vector) {[enter commands]}

This sets a variable called name equal to each of the elements of the vector in sequence. Each of these iterates over the command as many times as is necessary.

A simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers.

Fibonacci <- numeric(20) # creates a vector called Fibonacci that consists of 20 numeric vectors

Fibonacci[1] <- Fibonacci[2] <- 1 # defines the first and second elements as a value of 1. This is important b/c the first two Fibonacci numbers are 1, and the next (3rd) number is gained by adding the first two

for (i in 3:20) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] # says for each instance of the 3rd through 20th Fibonacci numbers, take the first element - 2 and add that to the next element - 1
Fibonacci
##  [1]    1    1    2    3    5    8   13   21   34   55   89  144  233  377
## [15]  610  987 1597 2584 4181 6765

There is another important component to flow control: the conditional. In programming this takes the form of if() statements.

Syntax

if (condition) {commands when TRUE}

if (condition) {commands when TRUE} else {commands when FALSE}

We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else’s code.

Now, using what we know about regular expressions and flow control, let’s have look at a for() loop that Matthew Jockers uses in Chapter 4 of his Text Analysis for Students of Literature. It’s a fairly complicated but useful way of breaking up a novel text into chapters for comparative analysis. Let’s return to Dickens.

text.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
text2.v <- gsub("\\d+[a-z]*", "", text.v)
not.blanks.v <- which(text2.v!="")
clean.text.v <- text2.v[not.blanks.v]

start.v <- which(clean.text.v == "Chapter I")
end.v <- which(clean.text.v == "THE END")
novel.lines.v <- clean.text.v[start.v:end.v]
chap.positions.v <- grep("^Chapter \\w", novel.lines.v)

novel.lines.v[chap.positions.v]
##  [1] "Chapter I"       "Chapter II"      "Chapter III"    
##  [4] "Chapter IV"      "Chapter V"       "Chapter VI"     
##  [7] "Chapter VII"     "Chapter VIII"    "Chapter IX"     
## [10] "Chapter X"       "Chapter XI"      "Chapter XII"    
## [13] "Chapter XIII"    "Chapter XIV"     "Chapter XV"     
## [16] "Chapter XVI"     "Chapter XVII"    "Chapter XVIII"  
## [19] "Chapter XIX"     "Chapter XX"      "Chapter XXI"    
## [22] "Chapter XXII"    "Chapter XXIII"   "Chapter XXIV"   
## [25] "Chapter XXV"     "Chapter XXVI"    "Chapter XXVII"  
## [28] "Chapter XXVIII"  "Chapter XXIX"    "Chapter XXX"    
## [31] "Chapter XXXI"    "Chapter XXXII"   "Chapter XXXIII" 
## [34] "Chapter XXXIV"   "Chapter XXXV"    "Chapter XXXVI"  
## [37] "Chapter XXXVII"  "Chapter XXXVIII" "Chapter XXXIX"  
## [40] "Chapter XL"      "Chapter XLI"     "Chapter XLII"   
## [43] "Chapter XLIII"   "Chapter XLIV"    "Chapter XLV"    
## [46] "Chapter XLVI"    "Chapter XLVII"   "Chapter XLVIII" 
## [49] "Chapter XLIX"    "Chapter L"       "Chapter LI"     
## [52] "Chapter LII"     "Chapter LIII"    "Chapter LIV"    
## [55] "Chapter LV"      "Chapter LVI"     "Chapter LVII"   
## [58] "Chapter LVIII"   "Chapter LIX"
chapter.raws.l <- list()
chapter.freqs.l <- list()

for(i in 1:length(chap.positions.v)){
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" ")) 
chapter.words.l <- strsplit(chapter.words.v, "\\W") 
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) 
chapter.raws.l[[chapter.title]] <- chapter.freqs.t 
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 
chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

chapter.freqs.l[1]
## $`Chapter I`
## chapter.word.v
##            a        about        above      abraham    aforesaid 
##   2.53505933   0.16181230   0.10787487   0.05393743   0.05393743 
##       afraid        after    afternoon        again          ain 
##   0.05393743   0.26968716   0.05393743   0.37756203   0.05393743 
##        alder    alexander          all        alone      alonger 
##   0.05393743   0.05393743   0.26968716   0.05393743   0.05393743 
##         also           am        among           an          and 
##   0.21574973   0.21574973   0.26968716   0.26968716   4.85436893 
##        angel        angry        ankle      another          any 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.16181230 
##         arms     arranged           as           at          ate 
##   0.21574973   0.05393743   0.86299892   0.75512406   0.10787487 
##      attempt       attend    authority         back        backs 
##   0.05393743   0.05393743   0.05393743   0.16181230   0.05393743 
##  bartholomew      battery           be       beacon          bed 
##   0.05393743   0.10787487   0.37756203   0.05393743   0.05393743 
##         been       before    beginning        being       belief 
##   0.16181230   0.10787487   0.05393743   0.10787487   0.05393743 
##      believe       beside         best       beyond         bits 
##   0.05393743   0.05393743   0.05393743   0.10787487   0.05393743 
##        black   blacksmith        bleak         body         born 
##   0.26968716   0.16181230   0.05393743   0.05393743   0.05393743 
##         both        bound          boy     brambles        bread 
##   0.32362460   0.05393743   0.16181230   0.05393743   0.10787487 
##       briars        bring        broad       broken     brothers 
##   0.05393743   0.16181230   0.10787487   0.10787487   0.05393743 
##       bundle       buried          but           by       called 
##   0.05393743   0.10787487   0.21574973   0.48543689   0.10787487 
##         came         cask       cattle   cautiously      certain 
##   0.26968716   0.05393743   0.10787487   0.05393743   0.05393743 
##       chains    character    chattered       cheeks     childish 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##     children         chin    christian       church   churchyard 
##   0.05393743   0.05393743   0.05393743   0.37756203   0.10787487 
##     clasping       closer      clothes        clung       coarse 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##         cold         come  comfortable       coming   comparison 
##   0.05393743   0.16181230   0.05393743   0.05393743   0.05393743 
##   concerning   conclusion  considering        could       couldn 
##   0.05393743   0.05393743   0.05393743   0.32362460   0.05393743 
##      country        creep        cried          cry       crying 
##   0.05393743   0.10787487   0.05393743   0.05393743   0.05393743 
##        curly          cut            d       danger         dare 
##   0.05393743   0.16181230   0.05393743   0.05393743   0.10787487 
##         dark       darkly         darn         days         dead 
##   0.10787487   0.05393743   0.05393743   0.10787487   0.21574973 
##        dense      derived        devil   difficulty        dikes 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##          dip      distant           do          dog          don 
##   0.05393743   0.05393743   0.16181230   0.05393743   0.16181230 
##         door         down         draw   dreadfully         drew 
##   0.05393743   0.32362460   0.05393743   0.05393743   0.05393743 
##      dropped         each        early    earnestly          eat 
##   0.05393743   0.10787487   0.16181230   0.05393743   0.05393743 
##         edge          eel           eh       either      eluding 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##           em      emptied  entertained      evening  exceedingly 
##   0.10787487   0.05393743   0.05393743   0.05393743   0.05393743 
##    existence    explained     explicit    expressed         eyes 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##         face         fail      faintly     faltered       family 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##      fancies          far          fat       father      fearful 
##   0.05393743   0.05393743   0.10787487   0.26968716   0.10787487 
##      feeding         feet         file         find        first 
##   0.05393743   0.10787487   0.21574973   0.05393743   0.10787487 
##         five         flat       flints         food         foot 
##   0.10787487   0.16181230   0.05393743   0.05393743   0.05393743 
##          for        found     freckled   frightened         frog 
##   0.53937433   0.05393743   0.05393743   0.10787487   0.05393743 
##         from       gained      gargery        gates         gave 
##   0.48543689   0.05393743   0.16181230   0.05393743   0.21574973 
##         gaze    georgiana          get      getting       gibbet 
##   0.05393743   0.16181230   0.37756203   0.05393743   0.05393743 
##        giddy         give     glancing       glared           go 
##   0.05393743   0.16181230   0.05393743   0.05393743   0.16181230 
##        going          goo         good          got        grave 
##   0.10787487   0.05393743   0.05393743   0.10787487   0.05393743 
##       graves         gray        great      greater        green 
##   0.10787487   0.05393743   0.16181230   0.05393743   0.05393743 
##      growing      growled           ha          had         hair 
##   0.05393743   0.05393743   0.10787487   0.26968716   0.05393743 
##         half          han        hands      hanging         hard 
##   0.10787487   0.10787487   0.16181230   0.05393743   0.05393743 
##      harming          has          hat         have       having 
##   0.05393743   0.05393743   0.05393743   0.10787487   0.05393743 
##           he         head        heads        hears        heart 
##   1.61812298   0.26968716   0.05393743   0.05393743   0.16181230 
##        heavy        heels         held   helplessly helplessness 
##   0.05393743   0.05393743   0.16181230   0.05393743   0.05393743 
##         here          hid         hide         high          him 
##   0.16181230   0.05393743   0.05393743   0.05393743   0.80906149 
##      himself          his         hold         home         hook 
##   0.43149946   1.18662352   0.21574973   0.16181230   0.05393743 
##         hope   horizontal     horrible          how       hugged 
##   0.05393743   0.10787487   0.05393743   0.05393743   0.05393743 
##      hugging            i         idea     identity           if 
##   0.05393743   2.85868393   0.05393743   0.05393743   0.37756203 
##   impression           in     indebted       infant  inscription 
##   0.05393743   1.29449838   0.05393743   0.10787487   0.05393743 
##       inside   intermixed  intersected         into         iron 
##   0.05393743   0.05393743   0.05393743   0.16181230   0.05393743 
##           is           it          its       itself          joe 
##   0.26968716   0.75512406   0.05393743   0.10787487   0.16181230 
##       jumped         just         keep      keeping       kindly 
##   0.05393743   0.16181230   0.21574973   0.05393743   0.10787487 
##         know         lair        lamed         late       latter 
##   0.10787487   0.05393743   0.05393743   0.10787487   0.05393743 
##          lay       leaden          leg         legs          let 
##   0.05393743   0.05393743   0.16181230   0.10787487   0.21574973 
##      letters      licking         life      lifting         like 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.16181230 
##     likeness       limped      limping         line        lines 
##   0.05393743   0.10787487   0.05393743   0.16181230   0.10787487 
##         lips       little         live        liver       living 
##   0.05393743   0.21574973   0.26968716   0.16181230   0.05393743 
##           ll         lock         long       longer         look 
##   0.10787487   0.05393743   0.21574973   0.05393743   0.10787487 
##       looked       lookee      looking         lord          lot 
##   0.37756203   0.10787487   0.10787487   0.05393743   0.05393743 
##          low     lozenges         made         make          man 
##   0.16181230   0.05393743   0.21574973   0.16181230   1.34843581 
##      married        marsh      marshes       matter          may 
##   0.05393743   0.05393743   0.16181230   0.05393743   0.32362460 
##           me    memorable       memory         mile        miles 
##   1.72599784   0.05393743   0.05393743   0.05393743   0.05393743 
##         mind         mine       moment         more      morning 
##   0.10787487   0.16181230   0.10787487   0.26968716   0.10787487 
##       morrow         most       mother       mounds        mouth 
##   0.05393743   0.21574973   0.26968716   0.10787487   0.05393743 
##          mrs         much          mud     muttered           my 
##   0.10787487   0.05393743   0.05393743   0.05393743   1.34843581 
##       myself         name        names         near       nearly 
##   0.16181230   0.21574973   0.05393743   0.05393743   0.05393743 
##         neat      nettles        never        night           no 
##   0.05393743   0.16181230   0.21574973   0.05393743   0.16181230 
##        noise          nor          not      nothing          now 
##   0.05393743   0.05393743   0.10787487   0.10787487   0.32362460 
##       numbed          odd           of          off           oh 
##   0.05393743   0.05393743   2.04962244   0.05393743   0.10787487 
##          old           on         once          one         only 
##   0.10787487   0.75512406   0.10787487   0.05393743   0.05393743 
##         open           or        other          our         ours 
##   0.05393743   0.53937433   0.05393743   0.05393743   0.05393743 
##          out         over    overgrown          own       parish 
##   0.37756203   0.48543689   0.05393743   0.05393743   0.10787487 
##   partickler       partly    pecooliar       people      perhaps 
##   0.05393743   0.10787487   0.05393743   0.05393743   0.10787487 
##       person       philip  photographs      picking        piece 
##   0.10787487   0.10787487   0.05393743   0.10787487   0.05393743 
##         pint          pip       pirate       pirrip        place 
##   0.05393743   0.37756203   0.10787487   0.16181230   0.10787487 
##       places      pleaded       please      pockets      pointed 
##   0.05393743   0.05393743   0.05393743   0.10787487   0.05393743 
##         pole     pollards        porch     position   powerfully 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##         pray      present    presently     prospect         pull 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##      pursued          put     question        quick          rag 
##   0.05393743   0.05393743   0.10787487   0.05393743   0.05393743 
##        rains          ran   ravenously          raw           re 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##          red    regarding  religiously     remember        river 
##   0.05393743   0.05393743   0.05393743   0.10787487   0.32362460 
##      roasted        roger         roll        round          row 
##   0.05393743   0.05393743   0.05393743   0.16181230   0.10787487 
##          run      rushing            s       sacred         safe 
##   0.05393743   0.05393743   0.32362460   0.05393743   0.05393743 
##         said      sailors         same       savage          saw 
##   0.86299892   0.05393743   0.05393743   0.05393743   0.37756203 
##          say    scattered          sea       seated       secret 
##   0.21574973   0.05393743   0.10787487   0.05393743   0.05393743 
##          see       seemed        seems         seen       seized 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##        sense          set      several        shake        shall 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##        shape     shivered      shivers        shoes        shore 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##        short     shoulder      shouldn         show   shuddering 
##   0.05393743   0.10787487   0.05393743   0.05393743   0.05393743 
##         sick       sickly         side         sign        signs 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##          sir       sister          sky        small    smothered 
##   0.70118662   0.10787487   0.05393743   0.10787487   0.05393743 
##           so       soaked       softly         some         sore 
##   0.59331176   0.05393743   0.05393743   0.05393743   0.05393743 
##        speak       square     standing      staring      started 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##        state      steeple      steered     stepping        stiff 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##        still        stone       stones      stopped     stopping 
##   0.10787487   0.10787487   0.10787487   0.10787487   0.05393743 
##        stout   stretching       strike       strong     struggle 
##   0.05393743   0.05393743   0.05393743   0.10787487   0.05393743 
##        stung         such       sudden      sumever     supposin 
##   0.05393743   0.10787487   0.05393743   0.05393743   0.05393743 
##            t        taken         tear        teeth         tell 
##   0.53937433   0.05393743   0.05393743   0.05393743   0.05393743 
##        terms     terrible       terror         than         that 
##   0.05393743   0.10787487   0.05393743   0.05393743   1.72599784 
##          the        their         them         then        there 
##   4.80043150   0.43149946   0.21574973   0.21574973   0.26968716 
##        these         they        thing       things        think 
##   0.10787487   0.21574973   0.05393743   0.10787487   0.10787487 
##         this       though      thought  threatening       throat 
##   0.26968716   0.05393743   0.10787487   0.05393743   0.10787487 
##         tide         tied      tighter       tilted         time 
##   0.05393743   0.05393743   0.05393743   0.32362460   0.16181230 
##        times      timidly           to       tobias     together 
##   0.05393743   0.05393743   2.31930960   0.05393743   0.05393743 
##    tombstone   tombstones       tongue          too         took 
##   0.21574973   0.05393743   0.05393743   0.10787487   0.10787487 
##          top         tore         torn      towards        trees 
##   0.05393743   0.05393743   0.05393743   0.26968716   0.05393743 
##    trembling   tremendous     trousers       trying         tuck 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##         turn       turned      turning       twenty        twist 
##   0.10787487   0.10787487   0.05393743   0.05393743   0.05393743 
##          two         ugly        under   undersized    undertook 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.05393743 
##     unhooped    universal unreasonably           up         upon 
##   0.05393743   0.05393743   0.05393743   0.37756203   0.16181230 
##      upright       upside           us          use           ve 
##   0.16181230   0.05393743   0.10787487   0.05393743   0.05393743 
##      village        vivid        voice         wain         wall 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.10787487 
##         warm          was        water          way  weathercock 
##   0.05393743   1.13268608   0.05393743   0.21574973   0.05393743 
##         went         were         wery          wet         what 
##   0.05393743   0.75512406   0.05393743   0.05393743   0.37756203 
##         when        where      whether        which        while 
##   0.37756203   0.16181230   0.10787487   0.37756203   0.05393743 
##          who        whose         wife   wilderness         will 
##   0.26968716   0.10787487   0.16181230   0.05393743   0.05393743 
##         wind         wish         with       within      without 
##   0.05393743   0.05393743   0.91693635   0.05393743   0.05393743 
##      wittles     wondered         word        words        would 
##   0.16181230   0.05393743   0.05393743   0.10787487   0.21574973 
##       wouldn        wound           ye        years          yes 
##   0.05393743   0.05393743   0.05393743   0.05393743   0.16181230 
##          yet       yonder          you        young         your 
##   0.05393743   0.05393743   1.56418554   0.64724919   0.59331176
length(chapter.freqs.l)[1]
## [1] 58

Suppose I wanted to get all relative frequencies of the word “father” in each chapter.

father.freqs <- lapply(chapter.freqs.l, '[', 'father')

father.freqs
## $`Chapter I`
##    father 
## 0.2696872 
## 
## $`Chapter II`
## <NA> 
##   NA 
## 
## $`Chapter III`
## <NA> 
##   NA 
## 
## $`Chapter IV`
## <NA> 
##   NA 
## 
## $`Chapter V`
## <NA> 
##   NA 
## 
## $`Chapter VI`
## <NA> 
##   NA 
## 
## $`Chapter VII`
##    father 
## 0.1470588 
## 
## $`Chapter VIII`
## <NA> 
##   NA 
## 
## $`Chapter IX`
##     father 
## 0.03696858 
## 
## $`Chapter X`
## <NA> 
##   NA 
## 
## $`Chapter XI`
## <NA> 
##   NA 
## 
## $`Chapter XII`
## <NA> 
##   NA 
## 
## $`Chapter XIII`
## <NA> 
##   NA 
## 
## $`Chapter XIV`
## <NA> 
##   NA 
## 
## $`Chapter XV`
## <NA> 
##   NA 
## 
## $`Chapter XVI`
## <NA> 
##   NA 
## 
## $`Chapter XVII`
## <NA> 
##   NA 
## 
## $`Chapter XVIII`
## <NA> 
##   NA 
## 
## $`Chapter XIX`
## <NA> 
##   NA 
## 
## $`Chapter XX`
##    father 
## 0.0621118 
## 
## $`Chapter XXI`
##    father 
## 0.1105583 
## 
## $`Chapter XXII`
##    father 
## 0.3774335 
## 
## $`Chapter XXIII`
##     father 
## 0.03092146 
## 
## $`Chapter XXIV`
## <NA> 
##   NA 
## 
## $`Chapter XXV`
## <NA> 
##   NA 
## 
## $`Chapter XXVI`
## <NA> 
##   NA 
## 
## $`Chapter XXVII`
##     father 
## 0.06512537 
## 
## $`Chapter XXVIII`
## <NA> 
##   NA 
## 
## $`Chapter XXIX`
##     father 
## 0.02010859 
## 
## $`Chapter XXX`
##    father 
## 0.2646281 
## 
## $`Chapter XXXI`
## <NA> 
##   NA 
## 
## $`Chapter XXXII`
## <NA> 
##   NA 
## 
## $`Chapter XXXIII`
## <NA> 
##   NA 
## 
## $`Chapter XXXIV`
##     father 
## 0.04230118 
## 
## $`Chapter XXXV`
## <NA> 
##   NA 
## 
## $`Chapter XXXVI`
## <NA> 
##   NA 
## 
## $`Chapter XXXVII`
##     father 
## 0.03483107 
## 
## $`Chapter XXXVIII`
## <NA> 
##   NA 
## 
## $`Chapter XXXIX`
##     father 
## 0.02002403 
## 
## $`Chapter XL`
## <NA> 
##   NA 
## 
## $`Chapter XLI`
## <NA> 
##   NA 
## 
## $`Chapter XLII`
## <NA> 
##   NA 
## 
## $`Chapter XLIII`
## <NA> 
##   NA 
## 
## $`Chapter XLIV`
## <NA> 
##   NA 
## 
## $`Chapter XLV`
## <NA> 
##   NA 
## 
## $`Chapter XLVI`
##    father 
## 0.1315789 
## 
## $`Chapter XLVII`
## <NA> 
##   NA 
## 
## $`Chapter XLVIII`
## <NA> 
##   NA 
## 
## $`Chapter XLIX`
## <NA> 
##   NA 
## 
## $`Chapter L`
##   father 
## 0.130719 
## 
## $`Chapter LI`
##    father 
## 0.2979146 
## 
## $`Chapter LII`
## <NA> 
##   NA 
## 
## $`Chapter LIII`
##     father 
## 0.01871958 
## 
## $`Chapter LIV`
## <NA> 
##   NA 
## 
## $`Chapter LV`
##     father 
## 0.03460208 
## 
## $`Chapter LVI`
## <NA> 
##   NA 
## 
## $`Chapter LVII`
## <NA> 
##   NA 
## 
## $`Chapter LVIII`
##     father 
## 0.03207184

You could also use variations of the which function to identify the chapters with the highest and lowest frequencies.

which.max(father.freqs)
## Chapter XXII 
##           22
which.min(father.freqs)
## Chapter LIII 
##           53

Exercise 2

Create a vector that confines your results to only the paragraphs with dialogue.

dialogue.v <- grep('("([^"]|"")*")', novel.lines.v) # grep is another regex function

novel.lines.v[dialogue.v][1:20] # check your work by finding all the dialogue lines in novel.lines.v
##  [1] "I give Pirrip as my father's family name, on the authority of his tombstone and my sister,—Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair. From the character and turn of the inscription, \"Also Georgiana Wife of the Above,\" I drew a childish conclusion that my mother was freckled and sickly. To five little stone lozenges, each about a foot and a half long, which were arranged in a neat row beside their grave, and were sacred to the memory of five little brothers of mine,—who gave up trying to get a living, exceedingly early in that universal struggle,—I am indebted for a belief I religiously entertained that they had all been born on their backs with their hands in their trousers-pockets, and had never taken them out in this state of existence."
##  [2] "\"Hold your noise!\" cried a terrible voice, as a man started up from among the graves at the side of the church porch. \"Keep still, you little devil, or I'll cut your throat!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
##  [3] "\"Oh! Don't cut my throat, sir,\" I pleaded in terror. \"Pray don't do it, sir.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [4] "\"Tell us your name!\" said the man. \"Quick!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
##  [5] "\"Pip, sir.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [6] "\"Once more,\" said the man, staring at me. \"Give it mouth!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [7] "\"Pip. Pip, sir.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
##  [8] "\"Show us where you live,\" said the man. \"Pint out the place!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [9] "\"You young dog,\" said the man, licking his lips, \"what fat cheeks you ha' got.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [10] "\"Darn me if I couldn't eat em,\" said the man, with a threatening shake of his head, \"and if I han't half a mind to't!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [11] "\"Now lookee here!\" said the man. \"Where's your mother?\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [12] "\"There, sir!\" said I."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [13] "\"There, sir!\" I timidly explained. \"Also Georgiana. That's my mother.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [14] "\"Oh!\" said he, coming back. \"And is that your father alonger your mother?\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [15] "\"Yes, sir,\" said I; \"him too; late of this parish.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [16] "\"Ha!\" he muttered then, considering. \"Who d'ye live with,—supposin' you're kindly let to live, which I han't made up my mind about?\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [17] "\"My sister, sir,—Mrs. Joe Gargery,—wife of Joe Gargery, the blacksmith, sir.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [18] "\"Blacksmith, eh?\" said he. And looked down at his leg."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [19] "\"Now lookee here,\" he said, \"the question being whether you're to be let to live. You know what a file is?\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [20] "\"Yes, sir.\""

Bonus Exercise

Modify the for loop in Jockers to find word frequencies only of content with dialogue.

dialogue.chapter.raws.l <- list()
dialogue.chapter.freqs.l <- list()

for(i in 1:length(chap.positions.v)){
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue
chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" ")) 
chapter.words.l <- strsplit(chapter.words.v, "\\W")
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) 
dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t 
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 
dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

dialogue.chapter.freqs.l[1]
## $`Chapter I`
## chapter.word.v
##            a        about        above        again          ain 
##    2.9342723    0.3521127    0.1173709    0.4694836    0.1173709 
##          all        alone      alonger         also           am 
##    0.1173709    0.1173709    0.1173709    0.2347418    0.4694836 
##        among           an          and        angel          any 
##    0.1173709    0.1173709    3.4037559    0.1173709    0.3521127 
##     arranged           as           at          ate      attempt 
##    0.1173709    0.5868545    0.9389671    0.1173709    0.1173709 
##       attend    authority         back        backs      battery 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##           be          bed         been       before        being 
##    0.5868545    0.1173709    0.1173709    0.1173709    0.1173709 
##       belief       beside        black   blacksmith         born 
##    0.1173709    0.1173709    0.1173709    0.3521127    0.1173709 
##         both          boy        bring     brothers          but 
##    0.2347418    0.3521127    0.3521127    0.1173709    0.1173709 
##    character       cheeks     childish       church      clothes 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##        clung         cold  comfortable       coming   comparison 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##   concerning   conclusion  considering        could       couldn 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##        creep        cried        curly          cut            d 
##    0.2347418    0.1173709    0.1173709    0.2347418    0.1173709 
##         dare         dark         darn         days         dead 
##    0.2347418    0.1173709    0.1173709    0.2347418    0.1173709 
##      derived        devil   difficulty           do          dog 
##    0.1173709    0.1173709    0.1173709    0.3521127    0.1173709 
##          don         door         down         draw   dreadfully 
##    0.3521127    0.1173709    0.1173709    0.1173709    0.1173709 
##         drew         each        early          eat          eel 
##    0.1173709    0.1173709    0.2347418    0.1173709    0.1173709 
##           eh       either           em  entertained  exceedingly 
##    0.1173709    0.1173709    0.2347418    0.1173709    0.1173709 
##    existence    explained         fail     faltered       family 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##      fancies          fat       father         file         find 
##    0.1173709    0.1173709    0.4694836    0.3521127    0.1173709 
##        first         five         flat         foot          for 
##    0.1173709    0.2347418    0.1173709    0.1173709    0.3521127 
##     freckled   frightened         frog         from      gargery 
##    0.1173709    0.1173709    0.1173709    0.7042254    0.3521127 
##         gave    georgiana          get      getting        giddy 
##    0.2347418    0.2347418    0.4694836    0.1173709    0.1173709 
##         give     glancing           go          goo         good 
##    0.2347418    0.1173709    0.1173709    0.1173709    0.1173709 
##          got        grave       graves        great           ha 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.2347418 
##          had         hair         half          han        hands 
##    0.2347418    0.1173709    0.2347418    0.2347418    0.2347418 
##         hard      harming          has         have       having 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##           he         head        hears        heart         here 
##    1.2910798    0.2347418    0.1173709    0.3521127    0.2347418 
##          hid         hide          him      himself          his 
##    0.1173709    0.1173709    0.5868545    0.4694836    1.0563380 
##         hold         home          how            i         idea 
##    0.2347418    0.1173709    0.1173709    3.1690141    0.1173709 
##           if           in     indebted  inscription       inside 
##    0.4694836    1.0563380    0.1173709    0.1173709    0.1173709 
##           is           it          joe         keep      keeping 
##    0.5868545    0.7042254    0.3521127    0.2347418    0.1173709 
##       kindly         know         late          leg          let 
##    0.2347418    0.2347418    0.1173709    0.1173709    0.4694836 
##      letters      licking         like     likeness         lips 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##       little         live        liver       living           ll 
##    0.3521127    0.5868545    0.3521127    0.1173709    0.2347418 
##         lock         long       looked       lookee         lord 
##    0.1173709    0.2347418    0.1173709    0.2347418    0.1173709 
##          lot     lozenges         made         make          man 
##    0.1173709    0.1173709    0.1173709    0.1173709    2.1126761 
##      married       matter          may           me       memory 
##    0.1173709    0.1173709    0.7042254    1.7605634    0.1173709 
##         mind         mine       moment         more      morning 
##    0.2347418    0.1173709    0.1173709    0.2347418    0.1173709 
##       morrow       mother        mouth          mrs         much 
##    0.1173709    0.5868545    0.1173709    0.2347418    0.1173709 
##     muttered           my         name         neat        never 
##    0.1173709    1.4084507    0.2347418    0.1173709    0.4694836 
##        night           no        noise          now          odd 
##    0.1173709    0.1173709    0.1173709    0.5868545    0.1173709 
##           of          off           oh          old           on 
##    2.1126761    0.1173709    0.2347418    0.1173709    0.3521127 
##         once         open           or          out         over 
##    0.1173709    0.1173709    0.8215962    0.4694836    0.3521127 
##       parish   partickler    pecooliar      perhaps       person 
##    0.1173709    0.1173709    0.1173709    0.2347418    0.2347418 
##  photographs         pint          pip       pirrip        place 
##    0.1173709    0.1173709    0.3521127    0.1173709    0.1173709 
##      pleaded       please      pockets        porch         pray 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##      present      pursued     question        quick           re 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.2347418 
##    regarding  religiously     remember      roasted          row 
##    0.1173709    0.1173709    0.2347418    0.1173709    0.1173709 
##            s       sacred         safe         said          saw 
##    0.5868545    0.1173709    0.1173709    1.6431925    0.2347418 
##          say       secret         seen        shake        shall 
##    0.3521127    0.1173709    0.1173709    0.1173709    0.2347418 
##        shape      shouldn         show         sick       sickly 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##         side         sign          sir       sister        small 
##    0.1173709    0.1173709    1.5258216    0.2347418    0.1173709 
##           so       softly        speak       square      staring 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##      started        state        still        stone        stout 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##       strike     struggle         such      sumever     supposin 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##            t        taken         tear         tell     terrible 
##    1.0563380    0.1173709    0.1173709    0.1173709    0.1173709 
##       terror         that          the        their         them 
##    0.1173709    1.9953052    3.1690141    0.7042254    0.3521127 
##         then        there         they        think         this 
##    0.1173709    0.3521127    0.2347418    0.2347418    0.2347418 
##  threatening       throat       tilted      timidly           to 
##    0.1173709    0.2347418    0.4694836    0.1173709    2.3474178 
##    tombstone   tombstones          too         tore     trousers 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
##       trying         tuck         turn    undertook    universal 
##    0.1173709    0.1173709    0.1173709    0.1173709    0.1173709 
## unreasonably           up      upright           us           ve 
##    0.1173709    0.4694836    0.1173709    0.2347418    0.1173709 
##        voice         wain         warm          was          way 
##    0.1173709    0.1173709    0.1173709    0.4694836    0.2347418 
##         were         wery          wet         what        where 
##    0.5868545    0.1173709    0.1173709    0.7042254    0.2347418 
##      whether        which          who         wife         will 
##    0.1173709    0.3521127    0.3521127    0.2347418    0.1173709 
##         wish         with      wittles         word        words 
##    0.1173709    0.9389671    0.3521127    0.1173709    0.2347418 
##        would           ye          yes       yonder          you 
##    0.1173709    0.1173709    0.3521127    0.1173709    3.2863850 
##        young         your 
##    1.1737089    1.2910798

Part II: Using TidyText to ‘read’ all of Livy

For these two lessons we will be modifying code from Julia Silge and David Robinson’s Text Mining with R: A Tidy Approach.

Before getting started, make sure you have set your working directory.

setwd("~/Desktop")

We did this to situate ourselves correctly within the filing system: we set our working directory to a reasonable place, the Desktop.

Note that the squiggly line (~) tells the system to return to the root (or home) directory, and your Desktop should be the next step (/) from the root. In Windows you would need to type out the file path, so something like C:\Users\[username]\Desktop.

Next we load the necessary libraries for these lessons. Note: If you get error messages, you will need to install the libraries by navigating to the “Packages” tab on the right-side panel of RStudio. Then click “Install,” enter the name of the package, and install it.

library(tidytext)
library(dplyr)
library(stringr)
library(glue)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(gutenbergr)

Before going into more details, I will briefly explain the ‘tidy’ approach to data that will be used in the following. The tidy approach assumes three principles regarding data structure:1

What results is a table with one-token-per-row. (Recall that a token is any meaningful unit of text: usually it is a word, but it can also be an n-gram, sentence, or even a root of a word.)

pound_poem <- c("The apparition of these faces in the crowd;", "Petals on a wet, black bough.")

pound_poem
## [1] "The apparition of these faces in the crowd;"
## [2] "Petals on a wet, black bough."

Here we have created a character vector like we did before: the vector consists of two strings of text. In order to transform this into tidy format, we need to transform it into a data frame (here called a ‘tibble’, a type of data frame in R that is more convenient for text-based analysis).

pound_poem_df <- tibble(line = 1:2, text = pound_poem)

pound_poem_df
## # A tibble: 2 x 2
##    line                                        text
##   <int>                                       <chr>
## 1     1 The apparition of these faces in the crowd;
## 2     2               Petals on a wet, black bough.

While better, this format is still not useful for tidy text analysis because we still need each word to be individually accounted for. To accomplish this act of tokenization, use the unnest_tokens function.

pound_poem_df %>% unnest_tokens(word, text)
## # A tibble: 14 x 2
##     line       word
##    <int>      <chr>
##  1     1        the
##  2     1 apparition
##  3     1         of
##  4     1      these
##  5     1      faces
##  6     1         in
##  7     1        the
##  8     1      crowd
##  9     2     petals
## 10     2         on
## 11     2          a
## 12     2        wet
## 13     2      black
## 14     2      bough
# the unnest_tokens function requires two arguments: the output column name (word), and the input column that the text comes from (text)

Notice how each word is in its own row, but also that its original line number is still intact. That is the basic logic of tidy text analysis. Now let’s apply this to a larger data set.

Using the gutenbergrpackage with tidytext:

By running the gutenberg_authors function, you can see the file format of the names.

gutenberg_authors
## # A tibble: 16,236 x 7
##    gutenberg_author_id                                     author
##                  <int>                                      <chr>
##  1                   1                              United States
##  2                   3                           Lincoln, Abraham
##  3                   4                             Henry, Patrick
##  4                   5                                 Adam, Paul
##  5                   7                             Carroll, Lewis
##  6                   8 United States. Central Intelligence Agency
##  7                   9                           Melville, Herman
##  8                  10              Barrie, J. M. (James Matthew)
##  9                  12                         Smith, Joseph, Jr.
## 10                  14                             Madison, James
## # ... with 16,226 more rows, and 5 more variables: alias <chr>,
## #   birthdate <int>, deathdate <int>, wikipedia <chr>, aliases <chr>

Let’s run our first file loading function.

# this searches gutenberg for titles with the author name specified after the 'str_detect' function
gutenberg_works(str_detect(author, "Livy"))$title
## [1] "Roman History, Books I-III"                                                                                                
## [2] "The History of Rome, Books 09 to 26\r\nLiterally Translated, with Notes and Illustrations, by D. Spillan and Cyrus Edmonds"
## [3] "The History of Rome, Books 27 to 36"                                                                                       
## [4] "The History of Rome, Books 01 to 08"                                                                                       
## [5] "The History of Rome, Books 37 to the End\nwith the Epitomes and Fragments of the Lost Books"

Did you notice anything wrong with this? The first result duplicates some of the content of the fourth, so we should not use that first text id. Remember, the first rule of scholarship is TRUST NO ONE. In computing, never trust your data. So we’ll narrow the ingestion of the gutenberg ids to start with the second result.

# creates a variable that takes all the gutenberg ids of 
ids <- gutenberg_works(str_detect(author, "Livy"))$gutenberg_id[2:5]

livy <- gutenbergr::gutenberg_download(ids)
livy <- livy %>%
  group_by(gutenberg_id) %>%
  mutate(line = row_number()) %>%
  ungroup()

Here we created a new vector called livy and invoked the ‘gutenberg_works’ function to find Livy. What does the gutenberg_download function do? Again, type in the ? before the function to receive a description from the R Documentation. Try the example function, too.

Also, from the code above you might be wondering what the $ and %>% symbols mean. The $ refers to a variable. The %>% is a connector (a pipe) that mimics nesting. The rule is that the object on the left side is passed as the first argument to the function on the right hand side, so considering the last two lines, mutate(line = row_number()) %>% ungroup() is the same as ungroup(mutate(line = row_number())). It just makes the code (and particularly multi-step functions) more readable.2

?gutenberg_download

Now let’s see what we have downloaded. R has a summary function to show metadata about the new vector we just created, livy.

summary(livy)
##   gutenberg_id       text                line      
##  Min.   :10907   Length:91533       Min.   :    1  
##  1st Qu.:12582   Class :character   1st Qu.: 5721  
##  Median :19725   Mode  :character   Median :11442  
##  Mean   :24567                      Mean   :11930  
##  3rd Qu.:44318                      3rd Qu.:17163  
##  Max.   :44318                      Max.   :31016

Now we transform this into a tidy data set.

tidy_livy <- livy %>%
  unnest_tokens(output = word, input = text, token = "words")
  
tidy_livy %>% 
  count(word, sort = TRUE) %>%
  filter(n > 4000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Now we are mostly seeing functions words in these results. But what is interesting about the function words? Notice the prominence of pronouns, for example.

Of course you will want to complement these results with substantive results (i.e., with stop words filtered out).

data(stop_words)

tidy_livy <- tidy_livy %>%
  anti_join(stop_words)
## Joining, by = "word"
livy_plot <- tidy_livy %>% 
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  ylab("Word frequencies in Livy's History of Rome") +
  coord_flip()

livy_plot

In the visual above, you might want to locate the button in the upper right corner ‘Show in New Window’, so that you can zoom the results out.

We might also want to read (or have a searchable list in a table) of the word frequencies. The first code block below renders the results above in a table, and the second code block writes all of the results into a csv (spreadsheet) file.

tidy_livy %>%
  count(word, sort = TRUE)
## # A tibble: 19,587 x 2
##      word     n
##     <chr> <int>
##  1    war  2520
##  2   time  2241
##  3  enemy  2193
##  4 romans  2168
##  5 consul  2167
##  6  roman  2103
##  7   city  2039
##  8 senate  1987
##  9 people  1963
## 10   army  1662
## # ... with 19,577 more rows
livy_words <- tidy_livy %>%
  count(word, sort = TRUE)

write_csv(livy_words, "livy_words.csv")

# Note that if you want to retain the tidy data (that is, the title-line-word columns in multiple works, say),
# then you would just invoke the tidy_livy variable: write_csv(tidy_livy, "livy_words.csv")

Much of what we have done can also be done in Voyant Tools, to be sure. However, we have been able to load data faster in R, and we have also organized the data is tidytext tables that allow us to make judgments about the similarities and differences between the works in the corpus. It is also important to stress that you retain more control over organizing and manipulating your data with R, whereas in Voyant you are beholden to unstructured text files in a pre-built visualization interface.

To illustrate this flexibility, let’s investigate the data in ways that are unique to R (and programming in general).

We might want to make similar calculations by book, which is easier now due to the tidy data structure.

livy_word_freqs_by_book <- tidy_livy %>%
  group_by(gutenberg_id) %>%
  count(word, sort = TRUE) %>%
  ungroup()

livy_word_freqs_by_book %>%
    filter(n > 250) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip()

This shows you the general trend of each word that is used more than 250 times in alphabetical order. We can also break up the results into individual graphs for each book.

livy_word_freqs_by_book %>%
    filter(n > 250) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip() + facet_wrap(facets = ~ gutenberg_id)

This might appear to be an overwhelming picture, but it is an immediate display of similarities and differences between books. Granted, they are slightly out of order (id 10907 is The History of Rome, Books 09 to 26, and 12582 is Books 01 to 08), but you can immediately notice how the first half differs from the second in its content.

We could re-engineer the code in the previous examples to look more closely at these results. First we’ll narrow our data set to the more interesting id numbers mentioned already.

livy2 <- gutenberg_download(c(10907, 44318))

livy_tidy2 <- livy2 %>%
  group_by(gutenberg_id) %>%
  mutate(line = row_number()) %>%
  ungroup()

livy_tidy2 <- livy_tidy2 %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
livy_word_freqs_by_book <- livy_tidy2 %>%
  group_by(gutenberg_id) %>%
  count(word, sort = TRUE) %>%
  ungroup()

livy_word_freqs_by_book %>%
    filter(n > 210) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip() + facet_wrap(facets = ~ gutenberg_id)

What is the most consistent word used throughout Livy’s History?

Let’s now compare these results to another important chronicler, from a different era: Herodotus.

herodotus <- gutenberg_download(c(2707, 2456))

This downloads the two-volume Histories of Herodotus e-text (note that the c values are the gutenberg ids of two vols of Herodotus’ Histories. The ids can be found by searching for texts on gutenberg.org, clicking on the Bibrec tab, and copying the EBook-No.).

tidy_herodotus <- herodotus %>%
  unnest_tokens(word, text)

tidy_herodotus %>%
  count(word, sort = TRUE)
## # A tibble: 12,669 x 2
##     word     n
##    <chr> <int>
##  1   the 23500
##  2   and 12463
##  3    of 11936
##  4    to 10289
##  5    in  5059
##  6  that  4601
##  7  they  4278
##  8    he  3996
##  9     a  3650
## 10   for  3335
## # ... with 12,659 more rows

What are the differences here with the Livy results?

Now let’s filter out the stop words again.

tidy_herodotus <- herodotus %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) 
## Joining, by = "word"
tidy_herodotus %>%
  count(word, sort = TRUE)
## # A tibble: 12,116 x 2
##         word     n
##        <chr> <int>
##  1      time   784
##  2      land   756
##  3       son   708
##  4      thou   669
##  5      king   639
##  6  persians   625
##  7  hellenes   579
##  8      army   447
##  9 athenians   446
## 10      city   445
## # ... with 12,106 more rows

We could also add into the mix yet another text. Let’s try Edward Gibbon.

gibbon <- gutenberg_works(author == "Gibbon, Edward") %>%
  gutenberg_download(meta_fields = "title")

tidy_gibbon <- gibbon %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_gibbon %>%
  count(word, sort = TRUE)
## # A tibble: 54,823 x 2
##        word     n
##       <chr> <int>
##  1 footnote  8533
##  2       ii  3932
##  3       de  3726
##  4      tom  3281
##  5  emperor  2399
##  6    roman  2262
##  7   empire  1911
##  8     rome  1776
##  9      iii  1745
## 10     time  1732
## # ... with 54,813 more rows

Let’s visualize the differences.

frequency <- bind_rows(mutate(tidy_livy, author = "Livy"),
                       mutate(tidy_herodotus, author = "Herodotus"),
                       mutate(tidy_gibbon, author = "Edward Gibbon")) %>% 
  mutate(word = str_extract(word, "['a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) %>% 
  gather(author, proportion, `Livy`:`Herodotus`)
library(scales)
ggplot(frequency, aes(x = proportion, y = `Edward Gibbon`, color = abs(`Edward Gibbon` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Edward Gibbon", x = NULL)
## Warning: Removed 103229 rows containing missing values (geom_point).
## Warning: Removed 103231 rows containing missing values (geom_text).

Words that group near the upper end of the diagonal line in these plots have similar frequencies in both sets of texts.

cor.test(data = frequency[frequency$author == "Livy",],
         ~ proportion + `Edward Gibbon`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Edward Gibbon
## t = 135.64, df = 12690, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7620814 0.7762877
## sample estimates:
##       cor 
## 0.7692796
cor.test(data = frequency[frequency$author == "Herodotus",],
         ~ proportion + `Edward Gibbon`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Edward Gibbon
## t = 150.62, df = 6851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8707856 0.8817733
## sample estimates:
##       cor 
## 0.8763934

What this proves (statistically) is that the word frequencies of Gibbon are more correlated to Herodotus than to Livy—which is fascinating, given that Gibbon what writing about the same subject as Livy!

What else can you infer from these comparisons?

Part III: Using TidyText to perform sentiment analysis

Let’s continue with two of our authors from the previous section: Herodotus and Livy. Now we will create a ‘words’ vector that goes through the standard tidytext process of uploading a text file, creating a dataframe of words and row numbers, and tokenizing the words in the text file.

livy <- livy %>% unnest_tokens(word,text)

write.table(livy, file = "livy.txt")

livy_words <- data_frame(file = paste0('livy.txt')) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Next you invoke the ‘inner_join’ function which is essentially a way of conflating a data set against another. Here we are joining the text data from Herodotus with a dictionary of sentiment words that assigns relative values to each word.

livy_words_sentiment <- inner_join(livy_words,
                              get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)
## Joining, by = "word"

Using the ggplot library, we can visualise the results.

livy_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume") + 
  scale_y_continuous("Bing net Sentiment")

Let’s make this interesting: let’s compare these results to Gibbon.

herodotus <- herodotus %>% unnest_tokens(word,text)

write.table(herodotus, file = "herodotus.txt")

herodotus_words <- data_frame(file = paste0("herodotus.txt")) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

herodotus_words_sentiment <- inner_join(herodotus_words,
                                                 get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)
## Joining, by = "word"
herodotus_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume (by percentage)") + 
  scale_y_continuous("Bing net sentiment of Herodotus's Histories...")

That’s quite a difference. Clearly the Roman histories were more interested in negative words. Let’s break down the Livy results into more understandable graphs.

bing_word_counts <- livy_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Livy",
       x = NULL) +
  coord_flip()
## Selecting by n

summary(bing_word_counts)
##      word            sentiment               n          
##  Length:2636        Length:2636        Min.   :   1.00  
##  Class :character   Class :character   1st Qu.:   2.00  
##  Mode  :character   Mode  :character   Median :   5.00  
##                                        Mean   :  20.23  
##                                        3rd Qu.:  16.00  
##                                        Max.   :2193.00
bing_word_counts <- herodotus_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Herodotus",
       x = NULL) +
  coord_flip()
## Selecting by n

summary(bing_word_counts)
##      word            sentiment               n          
##  Length:1340        Length:1340        Min.   :  1.000  
##  Class :character   Class :character   1st Qu.:  1.000  
##  Mode  :character   Mode  :character   Median :  3.000  
##                                        Mean   :  7.894  
##                                        3rd Qu.:  7.000  
##                                        Max.   :467.000

Another way to re-orient the sentiment results is to create a word cloud. Sometimes these can be useful for assessing the total weight of positivity or negativity in a corpus.

library(wordcloud)
library(reshape2)

# create a sentiment wordcloud of the Livy results

livy_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))

As you can see, the cloud displays the overall negativity that the line graph above suggested. Let’s see how that compares to Herodotus.

library(wordcloud)
library(reshape2)

# create a sentiment wordcloud of the Herodotus results

herodotus_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))

Exercise 3

Load your own texts (either from your own corpus or from a digital repository like Perseus or Project Gutenberg).

Posit a new question–or questions–about what you would like to investigate further.

Modify a code block(s) from Part I of the R Notebook to answer your question.

Publish your results

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

If you are interested in learning more about R and corpus linguistics, in addition to Silge and Robinson and Jockers, you could also consult R. H. Baayen’s Analyzing Linguistic Data: A practical introduction to statistics (Cambridge UP, 2008) and Stefan Gries’s Quantitative Corpus Linguistics with R, 2nd ed. (Routledge, 2017).

Some good web resources include Jeff Rydberg-Cox’s Introduction to R and David Silge and Julia Robinson’s Text Mining with R. Also be sure to examine the CRAN R Documentation site.


  1. For more on this, see Hadley Wickham’s “Tidy Data,” Journal of Statistical Software 59 (2014): 1–23. https://doi.org/10.18637/jss.v059.i10.

  2. Granted, it is not part of R’s base code, but it was defined by the magrittr package and is now widely used in the dplyr and tidyr packages.