forked from lmullen/dh-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnatural-language-processing.Rmd
300 lines (212 loc) · 20.3 KB
/
natural-language-processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "Natural Language Processing"
---
Natural language processing (NLP) is a set of techniques for using computers to detect in human language the kinds of things that humans detect automatically. For example, when you read a text, you parse a text out into paragraphs and sentences. You do not explicitly label the parts of speech, but you certainly understand them. You notice names of people and places as they come up. And you can tell whether a sentence or paragraph is happy or angry or sad.
This kind of analysis is difficult in any programming language, not least because human language can be so rich and subtle that computer languages cannot capture anywhere near the total amount of information "encoded" in it. But when it comes to natural language processing, R programmers have reason to envy Python programmers. The [Natural Langauge Toolkit](http://www.nltk.org/) (NLTK) for Python is a robust library and set of corpuses, and the accompanying book *[Natural Language Processing with Python](http://www.nltk.org/book/)* is an excellent guide to the practice of NLP.^[Steven Bird, Ewan Klein, and Edward Loper, *Natural Language Processing with Python* (Cambridge, MA: O'Reilly, 2009), <http://www.nltk.org/book/>.] As explained in the [introduction](introduction.html), R and Python are close competitors in many kinds of data analysis and digital history, but if you were going to do only NLP then the NLTK would be a clear winner.
Nevertheless, R does have good libraries for natural language processing. Because R is able to interface with other languages like C, C++, and Java, it is possible to use libraries written in those lower-level and hence faster languages, while writing your code in R and taking advantage of its functional programming style and its many other libraries for data analysis.^[Because the NLTK is written in Python it can be quite slow. R is not exactly a speed-demon either, but because most of the NLP libraries that it uses are written in Java, they are least have the potential to match or better NLTK's performance.] Indeed most of the techniques that are of most use for historians, such as word and sentence tokenization, n-gram creation, and named entity recogniztion are easily peformed in R.^[Natural language processing includes an large number of tasks: the [Wikipedia article on NLP](http://en.wikipedia.org/wiki/Natural_language_processing) lists more than thirty kinds of NLP, though these classifications are arbitrary. Many of these are not often essential to digital historians---for example, text to speech, part-of-speech tagging, and natural language question answering---or perhaps we should say, that they are insufficiently developed to be useful to historians yet. This chapter focuses on a few concete tasks with ready applications to history.]
After explaining how to install the necessary libraries, this chapter will use a sample paragraph to demonstrate a few key techniques. First, we will tokenize the paragraph into words, sentences, and n-grams. (The n-grams will be used in the chapter on [document similarity](document-similarity.html).) Next we will extract the names of people and places from the document. Finally we will use those techniques on the journals or autobiographies of three itinerant preachers from the nineteenth-century United States to see which people they mention in common and which places they visited. For other kinds of text-analysis problems, see the chapters on [document similarity](document-similarity.html) and on [topic modeling](topic-modeling.html).
## Installing the necessary libraries
There are several R packages which make natural langauge processing possible in R. The [NLP](http://cran.rstudio.org/web/packages/NLP/) package provides a set of classes and functions for NLP which are used widely by other packages in R. The [openNLP](http://cran.rstudio.org/web/packages/openNLP/) package provides an interface to the [Apache OpenNLP library](https://opennlp.apache.org/), which is written in Java. [RWeka](http://cran.rstudio.org/web/packages/RWeka/) provides an R interface to the [Weka data mining software](http://www.cs.waikato.ac.nz/ml/weka/), also written in Java. RWeka is especially useful for creating ngrams. You may wish to investigate the [qdap](http://cran.rstudio.org/web/packages/qdap/) package, which contains many functions for qualitative discourse analysis. You can see other possibly useful packages at the [CRAN task view on natural language processing](http://cran.rstudio.com/web/views/NaturalLanguageProcessing.html).^[The [NLP](http://cran.rstudio.org/web/packages/NLP/) package also supports the [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml) suite of tools written in Java. We will use openNLP instead in this chapter. But if you wish to install Stanford CoreNLP, you can install it from the DataCube repository: `install.packages("StanfordCoreNLP", repos = "http://datacube.wu.ac.at", type = "source")`.]
Both openNLP and RWeka depend on the [rJava](http://cran.rstudio.org/web/packages/rJava/) package, which provides the low-level connection to Java. To install these natural language processing packages, it is important to get a working installation of Java and rJava on your machine. If at all possible, you should use the system version of Java installed by your operating system. You can find out whether you have Java by running `which java` at your terminal. If Java is present, you can check its version by running `java -version`. Otherwise you will have to install the [Java Developer's Kit](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (JDK) to your computer. If you have to install Java, try running `R CMD javareconf` in your terminal (you may need to prefix that command with `sudo`).
On Ubuntu, it is easiest to install rJava from an Ubuntu package using the following command:
```
apt-get update
apt-get install r-cran-rjava
```
Once you have Java on your machine, you can install [rJava](http://cran.rstudio.org/web/packages/rJava/) from CRAN with `install.packages("rJava")`. If the installation of that package is successful and if it can find your version of Java, then you should be able to load the package without any messages. If you receive an error message when you attempt to load rJava, you may need to set path or [environment variables](http://stackoverflow.com/questions/603785/environment-variables-in-mac-os-x).^[Running `R CMD javareconf --help` may provide some guidance.]
```{r}
library(rJava)
```
Once you can successfuly load rJava, then you can install the other packages from CRAN.
```{r, eval=FALSE}
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
```
Note that these NLP packages and the underlying Java libraries depend on large models of human languages. For some uses in English you should not need to download other data or models. But if you want to use a language other than English, or if you want to use entity extraction as below, you can download the openNLP models from a repository (not CRAN but like it) called [Datacube](http://datacube.wu.ac.at/). You can look in the [NLP](http://cran.rstudio.org/web/packages/NLP/) documentation or use the following command to install models, substituting a language code for `en` as appropriate.
```{r eval=TRUE}
if(!require("openNLPmodels.en")) {
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
}
```
Assuming you have successfully installed everything you should be able to load the following libraries.
```{r}
library(NLP)
library(openNLP)
library(RWeka)
```
## Basic Tokenization
To learn some basics of text processing,^[You may also find it helpful to review the chapter on working with [strings](strings.html), or character vectors as they are properly called in R.] we will work with a paragraph from the biography of Jarena Lee, an African American woman born around 1783 who was an exhorter and itinerant in the African Methodist Episcopal Church.^[The paragraph is taken from Lee's biography in *American National Biography Online*, s.v. "Lee, Jarena," by Barbara McCaskill, <http://www.anb.org/articles/16/16-03109.html>.] Later we will work with the full text of Lee's *Religious Experience and Journal of Mrs. Jarena Lee*.
![Jarena Lee](screenshots/lee-jarena.jpg)
We will learn to read the text file into R, to break it into words and sentences, and to turn it into n-grams. These are all called tokenization, because we are breaking up the text into units of meaning, called tokens.
### Reading a text file
There are a number of ways to read a text file into R. In particular the `scan()` function provides many options for reading a file and breaking it into components. But the simplest way to read a plain text file is with the `readLines()` function, which reads each line as a separate character vector.
```{r}
bio <- readLines("data/nlp/anb-jarena-lee.txt")
print(bio)
```
You can see that there are thirteen lines in the file, each contained in a character vector. We can combine all of these character vectors into a single character vector using the `paste()` function, adding a space between each of them.
```{r}
bio <- paste(bio, collapse = " ")
print(bio)
```
### Sentence and Word Annotations
Now that we have the file loaded, we can begin to turn it into words and sentences. This is a prerequisite for any other kind of natural language processing, because those kinds of NLP actions will need to know where the words and sentences are. First we load the necessary libraries.
```{r}
library(NLP)
library(openNLP)
library(magrittr)
```
For many kinds of text processing it is sufficient, even preferable to use base R classes. But for NLP we are obligated to use the `String` class. We need to convert our bio variable to a string.
```{r}
bio <- as.String(bio)
```
Next we need to create annotators for words and sentences. Annotators are created by functions which load the underlying Java libraries. These functions then mark the places in the string where words and sentences start and end. The annotation functions are themselves created by functions.
```{r}
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
```
These annotators form a "pipeline" for annotating the text in our `bio` variable. First we have to determine where the sentences are, then we can determine where the words are. We can apply these annotator functions to our data using the `annotate()` function.
```{r}
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
```
The result is a annotation object. Looking at first few items contained in the object, we can the kind of information contained in the annotations object.
```{r}
class(bio_annotations)
head(bio_annotations)
```
We see that the annotation object contains a list of sentences (and also words) identified by position. That is, the first sentence in the document begins at character 1 and ends at character 111. The sentences also contain information about the positions of the words that comprise them.
We can combine the biography and the annotations to create what the NLP package calls an `AnnotatedPlainTextDocument`. If we wishd we could also associate metadata with the object using the `meta =` argument.
```{r}
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
```
Now we can extract information from our document using accessor functions like `sents()` to get the sentences and `words()` to get the words. We could get just the plain text with `as.character(bio_doc)`.
```{r}
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
```
This is already useful, since we could use the resulting lists of sentences words to perform other kinds of calculations. But there are other kinds of annotations which are more immediately relevant to historians.
### Annotating people and places
Among the several kinds of annotators provided by the [openNLP](http://cran.rstudio.org/web/packages/openNLP/) package is an entity annotator. An entity is basically a proper noun, such as a person or place name. Using a technique called named entity recognition (NER), we can extract various kinds of names from a document. In English, OpenNLP can find dates, locations, money, organizations, percentages, people, and times. (Acceptable values are `"date", "location", "money", "organization", "percentage", "person", "misc"`.) We will use it to find people, places, and organizations since all three are mentioned in our sample paragraph.
These kinds of annotator functions are created using the same kinds of constructor functions that we used for `word_ann()` and `sent_ann()`.
```{r}
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
```
Recall that we earlier passed a list of annotator functions to the `annotate()` function to indicate which kinds of annotations we wanted to make. We will create a new pipeline list to hold our annotators in the order we want to apply them, then apply it to the `bio` variable. Then, as before, we can create an `AnnotatedPlainTextDocument`.
```{r}
pipeline <- list(sent_ann,
word_ann,
person_ann,
location_ann,
organization_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
```
As before we could extract words and sentences using the getter methods `words()` and `sents()`. Unfortunately there is no comparably easy way to extract names entities from documents. But the function below will do the trick.
```{r}
# Extract entities from an AnnotatedPlainTextDocument
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
```
Now we can extract all of the named entities using `entities(bio_doc)`, and specific kinds of entities using the `kind = ` argument. Let's get all the people, places, and organizations.
```{r}
entities(bio_doc, kind = "person")
entities(bio_doc, kind = "location")
entities(bio_doc, kind = "organization")
```
Applying our techniques to this paragraph shows both the power and the limitations of NLP. We managed to extract the every person named in the text: Jarena Lee, Richard Allen, and Joseph Lee. But Jarena Lee's six children were not detected. Both New Jersey and Philadelphia were detected, but "Snow Hill, New Jersey" was not, perhaps because "snow" and "hill" fooled the algorithm into thinking they were common nouns. The Bethel African Methodist Episcopal Church was detected, but not the unnamed African American church in Snow Hill; arguably "Methodists" is also an institution.
Of course we would hardly rely on NLP for short texts such as this. NLP is potentially useful when applied to texts of greater length---especially to more texts than we could read ourself. In the next section, we will extend NLP to a small corpus of larger texts.
## Named Entity Recognition in a small corpus
Now that we know how to extract people and places from a text, we can do the same thing with a small corpus of texts. The code would be identical, or nearly so, for a much larger corpus. For this exercise, we will use three books by itinerant preachers in the nineteenth-century United States:
- Peter Cartwright, *Autobiography of Peter Cartwright, the Backwoods Preacher*, edited by W. P. Strickland (New York, 1857).
- Jarena Lee, *Religious Experience and Journal of Mrs. Jarena Lee* (Philadelphia, 1849).
- Parley Parker, *The Autobiography of Parley Parker Pratt* (New York, 1874).
These three people were rough contemporaries. Peter Cartwright (1785--1872) was a Methodist; Jarena Lee (1783--?) was a member of the African Methodist Episcopal Church; Parley Parker Pratt (1807--1857) was a Mormon apostle. Their books have been downloaded and OCRed from Google Books. Because they are mostly autobiographical, the places and people they mention were by and large actually places they went and people they met. Named entity recognition is thus likely to produce a picture closer actual experience instead of the imagined worlds of these people. These texts are by no means perfectly OCRed, but they do represent texts of the quality at which historians often have to work.
Throughout this exercise we will use R's functional programming facilities, especially `lapply()` to perform the same actions on each text. We could possibly copy each and every command for all three texts, change the names of files and variables as necessary. But this would be a recipe for disaster, or more likely, subtle error. By keeping to the programming principle DRY---don't repeat yourself---we can avoid unnecessary complications. Just as important, our code will be extensible not just to 3 documents, but to 300 or 3,000.
Before we start we have to load the necessary libraries.
```{r}
library(NLP)
library(openNLP)
library(magrittr)
```
Let's begin by finding the paths to each of the files using R's `Sys.glob()` function, which looks for wildcards in file names.
```{r}
filenames <- Sys.glob("data/itinerants/*.txt")
filenames
```
Now we can use `lapply()` to apply the `readLines()` function to each filename. The result will be a list with one item per file. And while we are at it, we can use `paste0()` to combine each line of the files into a single character vector, and `as.String()` to convert them to the format that the NLP packages expect. Finally, we can assign the names of the files to the items in the list.
```{r}
texts <- filenames %>%
lapply(readLines) %>%
lapply(paste0, collapse = " ") %>%
lapply(as.String)
names(texts) <- basename(filenames)
str(texts, max.level = 1)
```
Next we need to take the steps we used above to annotate a document and abstract them into a function. This will let us use the function with `lapply()`. It will also permit us to re-use the code in future projects. This function will return an object of class `AnnotatedPlainTextDocument`.
```{r}
annotate_entities <- function(doc, annotation_pipeline) {
annotations <- annotate(doc, annotation_pipeline)
AnnotatedPlainTextDocument(doc, annotations)
}
```
Now we can define the pipeline of annotation functions that we are interested in. We will use just person and locations:
```{r}
itinerants_pipeline <- list(
Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator(),
Maxent_Entity_Annotator(kind = "person"),
Maxent_Entity_Annotator(kind = "location")
)
```
Now we can call our `annotate_entities()` function on each item in our list. (This function will take a considerable amount of time: a little more than a half hour on my computer.)
```{r, echo=TRUE}
# We won't actually run this long-running process. Instead we will just load the
# cached results.
load("data/nlp-cache.rda")
```
```{r, cache=TRUE, eval=FALSE}
texts_annotated <- texts %>%
lapply(annotate_entities, itinerants_pipeline)
```
It is now possible to use our `entities()` function defined above to extract the relevant information. We could keep these all in a single list object, but to keep it from being unwieldy, we will create a list of places and of people mentioned in each text.
```{r}
places <- texts_annotated %>%
lapply(entities, kind = "location")
people <- texts_annotated %>%
lapply(entities, kind = "person")
```
A few statistics will give us a sense of what we have managed to extract. We can count up the number of items, as well as the number of unique items for each text.
```{r}
# Total place mentions
places %>%
sapply(length)
# Unique places
places %>%
lapply(unique) %>%
sapply(length)
# Total mentions of people
people %>%
sapply(length)
# Unique people mentioned
people %>%
lapply(unique) %>%
sapply(length)
```
We could do a lot with this information. We could improve the lists by editing them with our knowledge as historians. In particular, we could geocode the locations and create a map of the world of each itinerant. (See the chapter on [mapping](mapping.html).) We also have created a simple list of the people and places without regard for where they are in the document. But we have the exact location of each person and place in the document, and could use that information for further analysis.
```{r}
library(ggmap)
all_places <- union(places[["pratt-parley.txt"]], places[["cartwright-peter.txt"]]) %>% union(places[["lee-jarena.txt"]])
# all_places_geocoded <- geocode(all_places)
```
In this next section we'll look at how to use some of R's set functions to figure out the overlap between these three itinerants.