-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
239 lines (176 loc) · 9.33 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
dpi = 92,
fig.retina = 2,
out.width = "100%"
)
```
# vocabular2
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
<!-- [![CRAN status](https://www.r-pkg.org/badges/version/vocabr)](https://CRAN.R-project.org/package=vocabr) -->
<!-- badges: end -->
The goal of vocabular2 is to compare vocabularies on a set of metrics.
There's currently no clear development path for the package. It may become usable in the future, but for now it's not adviced to use the code for your projects.
I haven't spent enough time thinking about the meaningfulness of the metrics to recommend them. They were simply intuitive to me at 4am on some exam-stressed winter night. It's also very possible that they are in the literature under different names. :)
## Installation
You can install the development version with:
``` r
devtools::install_github("ludvigolsen/vocabular2")
```
## Main functions
* `compare_vocabs()`
* `get_doc_metrics()`
* `stack_doc_metrics()`
## Simple Example
Note: By default, negative values are set to 0 for most of the metrics (not TD-IDF and TF-IRF).
See the metric formulas below the example.
### Attach packages
```{r message=FALSE, warning=FALSE}
library(vocabular2)
library(tm)
library(tidyverse)
library(knitr)
```
### Load the included 'hamlet' dataset
```{r}
# The included dataset with Hamlet lines
# Extracted from https://www.opensourceshakespeare.org/
hamlet %>% head(5)
# Collect the lines for each character
data <- hamlet %>%
dplyr::group_by(Character) %>%
dplyr::summarise(txt = paste0(Line, collapse = " "))
data
# Assign each text to a variable
# This could be done in a loop if we had a lot of texts
claudius <- data[1, "txt"][[1]]
gertrude <- data[2, "txt"][[1]]
hamlet <- data[3, "txt"][[1]] # note: overwrites the dataset
horatio <- data[4, "txt"][[1]]
ophelia <- data[5, "txt"][[1]]
```
### Count the terms
```{r warning=FALSE}
# Create a term-count tibble for each document
count_terms <- function(t){
docs <- Corpus(VectorSource(t))
# do things like removing stopwords, lemmatization, etc.
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = TRUE)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- tibble::tibble(Word = names(v), Count=v)
d
}
claudius_tc <- count_terms(claudius)
gertrude_tc <- count_terms(gertrude)
hamlet_tc <- count_terms(hamlet)
horatio_tc <- count_terms(horatio)
ophelia_tc <- count_terms(ophelia)
```
### Compare the vocabularies
This is where the metrics are calculated. We get a column per document with a nested tibble containing the metrics.
```{r}
scores <- compare_vocabs(tc_dfs = list("claudius" = claudius_tc,
"gertrude" = gertrude_tc,
"hamlet" = hamlet_tc,
"horatio" = horatio_tc,
"ophelia" = ophelia_tc))
scores
```
### Extract the metrics for Claudius
```{r}
get_doc_metrics(scores, "claudius") %>%
arrange(desc(REL_TF_NRTF)) %>%
head(10) %>%
kable()
```
### Extract the metrics for Gertrude
```{r}
get_doc_metrics(scores, "gertrude") %>%
arrange(desc(REL_TF_NRTF)) %>%
head(10) %>%
kable()
```
### Extract the metrics for Hamlet
```{r}
get_doc_metrics(scores, "hamlet") %>%
arrange(desc(REL_TF_NRTF)) %>%
head(10) %>%
kable()
```
### Extract the metrics for Horatio
```{r}
get_doc_metrics(scores, "horatio") %>%
arrange(desc(REL_TF_NRTF)) %>%
head(10) %>%
kable()
```
### Extract the metrics for Ophelia
```{r}
get_doc_metrics(scores, "ophelia") %>%
arrange(desc(REL_TF_NRTF)) %>%
head(10) %>%
kable()
```
### Extract and stack metrics for all documents
```{r}
stack_doc_metrics(scores)
```
## Metrics
### TF-IDF and TF-IRF (Term Frequency - Inverse Rest Frequency)
These are highly correlated (>0.999).
<!-- We will only see the equations in GitHub.
Get the url at
https://www.codecogs.com/latex/eqneditor.php-->
<!--$$ tf(t,d)=\frac{f_{t,d}}{\sum_{t'}^{d}f_{t',d}} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tf%28t%2Cd%29%3D%5Cfrac%7Bf_%7Bt%2Cd%7D%7D%7B%5Csum_%7Bt%27%7D%5E%7Bd%7Df_%7Bt%27%2Cd%7D%7D)
<!--$$ idf(t,D)=\log{\frac{|D|}{1+|\{d \in D:t \in d\}|}} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20idf%28t%2CD%29%3D%5Clog%7B%5Cfrac%7B%7CD%7C%7D%7B1+%7C%7Bd%20%5Cin%20D%3At%20%5Cin%20d%7D%7C%7D%7D)
<!--$$ irf(t,d,D)=\log{\frac{|D|-1}{1+|\{d \in D:t \in d \land d' \not = d \}|}} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20irf%28t%2Cd%2CD%29%3D%5Clog%7B%5Cfrac%7B%7CD%7C-1%7D%7B1+%7C%5C%7Bd%20%5Cin%20D%3At%20%5Cin%20d%20%5Cland%20d%27%20%5Cnot%20%3D%20d%20%5C%7D%7C%7D%7D)
<!--$$ tfidf(t,d,D) = tf(t,d) \cdot idf(t,D) $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tfidf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29%20%5Ccdot%20idf%28t%2CD%29)
<!--$$ tfirf(t,d,D) = tf(t,d) \cdot irf(t,d,D) $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tfirf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29%20%5Ccdot%20irf%28t%2Cd%2CD%29)
### TF-RTF (Term Frequency - Rest Term Frequency)
TF-RTF is positive when the term frequency is higher in the current document than the sum of the term frequencies in the rest of the corpus.
<!--$$ rtf(t,d,D) = \sum_{d' \not = d}^{D}tf(t,d') $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20rtf%28t%2Cd%2CD%29%20%3D%20%5Csum_%7Bd%27%20%5Cnot%20%3D%20d%7D%5E%7BD%7Dtf%28t%2Cd%27%29)
<!--$$ tfrtf(t,d,D) = tf(t,d)-rtf(t,d,D) $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tfrtf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29-rtf%28t%2Cd%2CD%29)
### TF-NRTF (Term Frequency - Normalized Rest Term Frequency)
As our selected TF function ensures that frequencies add up to 1 document-wise, the NRTF (Normalized Rest Term Frequency) is simply the average term frequency in the other documents, instead of the sum as in RTF.
TF-NRTF is positive when the term frequency is higher in the current document than the average term frequency in the rest of the corpus.
<!--$$ nrtf(t,d,D) = \frac{rtf(t,d,D)}{|D|-1} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20nrtf%28t%2Cd%2CD%29%20%3D%20%5Cfrac%7Brtf%28t%2Cd%2CD%29%7D%7B%7CD%7C-1%7D)
<!--$$ tfnrtf(t,d,D) = tf(t,d)-nrtf(t,d,D) $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tfnrtf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29-nrtf%28t%2Cd%2CD%29)
### TF-MRTF (Term Frequency - Maximum Rest Term Frequency)
Instead of the normalized/average rest term frequency, we instead use the maximum rest term frequency.
TF-MRTF is positive when the term frequency is higher in the current document than the maximum term frequency in the rest of the corpus.
<!--$$ Mrtf(t,d,D) = \max{\{tf(t,d'):d' \in D \land d' \not = d\}} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20Mrtf%28t%2Cd%2CD%29%20%3D%20%5Cmax%7B%5C%7Btf%28t%2Cd%27%29%3Ad%27%20%5Cin%20D%20%5Cland%20d%27%20%5Cnot%20%3D%20d%5C%7D%7D)
<!--$$ tfMrtf(t,d,D) = tf(t,d)-Mrtf(t,d,D) $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20tfMrtf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29-Mrtf%28t%2Cd%2CD%29)
### Relative TF-NRTF (Relative Term Frequency - Normalized Rest Term Frequency)
Where the TF-NRTF tend to be dominated by highly frequent words, the Relative TF-NRTF instead uses the relative distance to the NRTF. As that would likely be dominated by very infrequent words, we multiply it by the term frequency.
<!--$$ \epsilon(t,d,D) = \frac{1}{\sum_{d' \not = d}^{D}f_{t,d'}} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20%5Cepsilon%28t%2Cd%2CD%29%20%3D%20%5Cfrac%7B1%7D%7B%5Csum_%7Bd%27%20%5Cnot%20%3D%20d%7D%5E%7BD%7Df_%7Bt%2Cd%27%7D%7D)
<!--$$ rel\_tfnrtf(t,d,D) = tf(t,d)^{\beta}\frac{tfnrtf(t,d,D)}{\log(1 + nrtf(t,d,D) + \epsilon(t,d,D))} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20rel%5C_tfnrtf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29%5E%7B%5Cbeta%7D%5Cfrac%7Btfnrtf%28t%2Cd%2CD%29%7D%7B%5Clog%281%20+%20nrtf%28t%2Cd%2CD%29%20+%20%5Cepsilon%28t%2Cd%2CD%29%29%7D)
Epsilon (ε) is added to avoid zero-division. It is calculated to resemble +1 smoothing in the rest population.
The beta (β) exponentiator allows us to control the influence of the term frequency. By setting it to 0, we simply get the relative difference (log scaled).
### Relative TF-MRTF (Relative Term Frequency - Maximum Rest Term Frequency)
Similar to Relative TF-NRTF but for MRTF instead.
<!--$$ rel\_tfMrtf(t,d,D) = tf(t,d)^{\beta}\frac{tfMrtf(t,d,D)}{\log(1 + Mrtf(t,d,D) + \epsilon(t,d,D))} $$-->
![equation](https://latex.codecogs.com/svg.latex?%5Cdpi%7B300%7D%20%5Cfn_cm%20rel%5C_tfMrtf%28t%2Cd%2CD%29%20%3D%20tf%28t%2Cd%29%5E%7B%5Cbeta%7D%5Cfrac%7BtfMrtf%28t%2Cd%2CD%29%7D%7B%5Clog%281%20+%20Mrtf%28t%2Cd%2CD%29%20+%20%5Cepsilon%28t%2Cd%2CD%29%29%7D)