forked from dgrtwo/fuzzyjoin
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
279 lines (197 loc) · 11 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
cache = TRUE,
fig.path = "tools/README-",
cache.path = "README-cache/",
message = FALSE,
warning = FALSE
)
```
fuzzyjoin: Join data frames on inexact matching
------------------
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/fuzzyjoin)](https://cran.r-project.org/package=fuzzyjoin)
[![Travis-CI Build Status](https://travis-ci.org/dgrtwo/fuzzyjoin.svg?branch=master)](https://travis-ci.org/dgrtwo/fuzzyjoin)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/dgrtwo/fuzzyjoin?branch=master&svg=true)](https://ci.appveyor.com/project/dgrtwo/fuzzyjoin)
[![Coverage Status](https://img.shields.io/codecov/c/github/dgrtwo/fuzzyjoin/master.svg)](https://codecov.io/github/dgrtwo/fuzzyjoin?branch=master)
The fuzzyjoin package is a variation on dplyr's join operations that allows matching not just on values that match between columns, but on inexact matching. This allows matching on:
* Numeric values that are within some tolerance (`difference_inner_join`)
* Strings that are similar in Levenshtein/cosine/Jaccard distance, or [other metrics](http://finzi.psych.upenn.edu/library/stringdist/html/stringdist-metrics.html) from the [stringdist](https://cran.r-project.org/package=stringdist) package (`stringdist_inner_join`)
* A regular expression in one column matching to another (`regex_inner_join`)
* Euclidean or Manhattan distance across multiple columns (`distance_inner_join`)
* Geographic distance based on longitude and latitude (`geo_inner_join`)
* Intervals of (start, end) that overlap (`interval_inner_join`)
* Genomic intervals, which include both a chromosome ID and (start, end) pairs, that overlap (`genome_inner_join`)
One relevant use case is for classifying freeform text data (such as survey responses) against a finite set of options.
The package also includes:
* For each of `regex_`, `stringdist_`, `difference_`, `distance_`, `geo_`, and `interval_`, variations for the six dplyr "join" operations- for example,
* `regex_inner_join` (include only rows with matches in each)
* `regex_left_join` (include all rows of left table)
* `regex_right_join` (include all rows of right table)
* `regex_full_join` (include all rows in each table)
* `regex_semi_join` (filter left table for rows with matches)
* `regex_anti_join` (filter left table for rows without matches)
* A general wrapper (`fuzzy_join`) that allows you to define your own custom fuzzy matching function.
* The option to include the calculated distance as a column in your output, using the `distance_col` argument
### Installation
Install from CRAN with:
```{r eval = FALSE}
install.packages("fuzzyjoin")
```
You can also install the development version from GitHub using [devtools](https://cran.r-project.org/package=devtools):
```{r eval = FALSE}
devtools::install_github("dgrtwo/fuzzyjoin")
```
### Example of `stringdist_inner_join`: Correcting misspellings against a dictionary
Often you find yourself with a set of words that you want to combine with a "dictionary"- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.
The fuzzyjoin package comes with a set of common misspellings ([from Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines)):
```{r}
library(dplyr)
library(fuzzyjoin)
data(misspellings)
misspellings
```
```{r words}
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)
words
```
As an example, we'll pick 1000 of these words (you could try it on all of them though), and use `stringdist_inner_join` to join them against our dictionary.
```{r sub_misspellings}
set.seed(2016)
sub_misspellings <- misspellings %>%
sample_n(1000)
```
```{r joined, dependson = c("words", "sub_misspellings")}
joined <- sub_misspellings %>%
stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)
```
By default, `stringdist_inner_join` uses optimal string alignment (Damerau–Levenshtein distance), and we're setting a maximum distance of 1 for a join. Notice that they've been joined in cases where `misspelling` is close to (but not equal to) `word`:
```{r dependson = "joined"}
joined
```
#### Classification accuracy
Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per "guess" in the output. How many words did we classify?
```{r dependson = "joined"}
joined %>%
count(misspelling, correct)
```
So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?
```{r dependson = "joined"}
which_correct <- joined %>%
group_by(misspelling, correct) %>%
summarize(guesses = n(), one_correct = any(correct == word))
which_correct
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
```
Not bad.
#### Reporting distance in the joined output
If you wanted to include the distance as a column in your output, you can use the `distance_col` argument. For example, we may be interested in how many words were *two* letters apart.
```{r joined_dists, dependson = "sub_misspellings"}
joined_dists <- sub_misspellings %>%
stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 2,
distance_col = "distance")
joined_dists
```
Note the extra `distance` column, which in this case will always be less than or equal to 2. We could then pick the closest match for each, and examine how many of our closest matches were 1 or 2 away:
```{r, dependson = "joined_dists"}
closest <- joined_dists %>%
group_by(misspelling) %>%
top_n(1, desc(distance)) %>%
ungroup()
closest
closest %>%
count(distance)
```
#### Other joining functions
Note that `stringdist_inner_join` is not the only function we can use. If we're interested in including the words that we *couldn't* classify, we could have use `stringdist_left_join`:
```{r left_joined, dependson = "misspellings"}
left_joined <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)
left_joined
left_joined %>%
filter(is.na(word))
```
(To get *just* the ones without matches immediately, we could have used `stringdist_anti_join`). If we increase our distance threshold, we'll increase the fraction with a correct guess, but also get more false positive guesses:
```{r left_joined2, dependson = "misspellings"}
left_joined2 <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)
left_joined2
left_joined2 %>%
filter(is.na(word))
```
Most of the missing words here simply aren't in our dictionary.
You can try other distance thresholds, other dictionaries, and other distance metrics (see [stringdist-metrics] for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.
### Example of `regex_inner_join`: Classifying text based on regular expressions
Consider the book Pride and Prejudice, by Jane Austen, which we can access through the [janeaustenr package](https://cran.r-project.org/package=janeaustenr).
We could split the books up into "passages" of 50 lines each.
```{r passages}
library(dplyr)
library(stringr)
library(janeaustenr)
passages <- tibble(text = prideprejudice) %>%
group_by(passage = 1 + row_number() %/% 50) %>%
summarize(text = paste(text, collapse = " "))
passages
```
Suppose we wanted to divide the passages based on which character's name is mentioned in each. Character's names may differ in how they are presented, so we construct a regular expression for each and pair it with that character's name.
```{r characters}
characters <- readr::read_csv(
"character,character_regex
Elizabeth,Elizabeth
Darcy,Darcy
Mr. Bennet,Mr. Bennet
Mrs. Bennet,Mrs. Bennet
Jane,Jane
Mary,Mary
Lydia,Lydia
Kitty,Kitty
Wickham,Wickham
Mr. Collins,Collins
Lady Catherine de Bourgh,de Bourgh
Mr. Gardiner,Mr. Gardiner
Mrs. Gardiner,Mrs. Gardiner
Charlotte Lucas,(Charlotte|Lucas)
")
```
Notice that for each character, we've defined a regular expression (sometimes allowing ambiguity, sometimes not) for detecting their name. Suppose we want to "classify" passages based on whether this regex is present.
With fuzzyjoin's `regex_inner_join` function, we do:
```{r character_passages, dependson = c("passages", "characters")}
character_passages <- passages %>%
regex_inner_join(characters, by = c(text = "character_regex"))
```
This combines the two data frames based on cases where the `passages$text` column is matched by the `characters$character_regex` column. (Note that the dataset with the text column must always come first). This results in:
```{r dependson = "character_passages"}
character_passages %>%
select(passage, character, text)
```
This shows that Mr. Bennet's name appears in passages 1, 2, 4, and 6, while Charlotte Lucas's appears in 3. Notice that having fuzzy-joined the datasets, some passages will end up duplicated (those with multiple names in them), while it's possible others will be missing entirely (those without names).
We could ask which characters are mentioned in the most passages:
```{r dependson = "character_passages"}
character_passages %>%
count(character, sort = TRUE)
```
The data is also well suited to discover which characters appear in scenes together, and to cluster them to find groupings of characters (like in [this analysis](http://varianceexplained.org/r/love-actually-network/)).
```{r character_passages_matrix, dependson = "character_passages", fig.width = 6, fig.height = 6}
passage_character_matrix <- character_passages %>%
group_by(passage) %>%
filter(n() > 1) %>%
reshape2::acast(character ~ passage, fun.aggregate = length, fill = 0)
passage_character_matrix <- passage_character_matrix / rowSums(passage_character_matrix)
h <- hclust(dist(passage_character_matrix, method = "manhattan"))
plot(h)
```
Other options for further analysis of this fuzzy-joined dataset include doing sentiment analysis on text surrounding each character's name, [similar to Julia Silge's analysis here](http://juliasilge.com/blog/You-Must-Allow-Me/).
### Future Work
A few things I'd like to work on:
* **Shortcuts on string distance matching**: If two strings are more than 1 character apart in length, the method is `osa`, and `max_dist` is 1, you don't even need to compare them.
* **More examples**: I've used this package in other powerful ways, but on proprietary data. I'm interested in ideas for use cases that can be provided as vignettes.
### Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/dgrtwo/fuzzyjoin/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.