Skip to content

Commit

Permalink
Initial commit 🎉
Browse files Browse the repository at this point in the history
  • Loading branch information
juliasilge committed Jun 13, 2019
0 parents commit e117ba8
Show file tree
Hide file tree
Showing 17 changed files with 683 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
^.*\.Rproj$
^\.Rproj\.user$
^LICENSE\.md$
^README\.Rmd$
^CODE_OF_CONDUCT\.md$
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
25 changes: 25 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Contributor Code of Conduct

As contributors and maintainers of this project, we pledge to respect all people who
contribute through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or
imagery, derogatory comments or personal attacks, trolling, public or private harassment,
insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments,
commits, code, wiki edits, issues, and other contributions that are not aligned to this
Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed
from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by
opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant
(https://www.contributor-covenant.org), version 1.0.0, available at
https://contributor-covenant.org/version/1/0/0/.
27 changes: 27 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Package: tidylo
Type: Package
Title: Tidy Log Odds Ratio Weighted by Uninformative Prior
Version: 0.1.0.900
Authors@R: c(person("Tyler", "Schnoebelen", role = c("aut"),
email = "tylersemailhere"),
person("Julia", "Silge", role = c("aut", "cre"),
email = "[email protected]",
comment = c(ORCID = "0000-0002-3671-836X")))
Description: Calculate the log odds ratio, weighted by an uninformative
Dirichlet prior, using tidy data principles.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: TRUE
URL: http://github.com/juliasilge/tidylo
BugReports: http://github.com/juliasilge/tidylo/issues
Imports: rlang,
dplyr
Suggests:
knitr,
rmarkdown,
tidytext,
janeaustenr,
ggplot2,
testthat (>= 2.1.0)
VignetteBuilder: knitr
RoxygenNote: 6.1.1
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2019
COPYRIGHT HOLDER: Julia Silge and Tyler Schnoebelen
21 changes: 21 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# MIT License

Copyright (c) 2019 Julia Silge and Tyler Schnoebelen

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
14 changes: 14 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Generated by roxygen2: do not edit by hand

export(bind_log_odds)
importFrom(dplyr,count)
importFrom(dplyr,group_by)
importFrom(dplyr,group_vars)
importFrom(dplyr,left_join)
importFrom(dplyr,mutate)
importFrom(dplyr,rename)
importFrom(dplyr,ungroup)
importFrom(rlang,as_name)
importFrom(rlang,enquo)
importFrom(rlang,is_empty)
importFrom(rlang,sym)
77 changes: 77 additions & 0 deletions R/bind_log_odds.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#' Bind the weighted log odds to a tidy dataset
#'
#' Calculate and bind the log odds ratio, weighted by an uninformative Dirichlet
#' prior, of a tidy dataset to the dataset itself. The weighted log odds ratio
#' is added as a column. This functions supports non-standard evaluation through
#' the tidyeval framework.
#'
#' @param tbl A tidy dataset with one row per item and feature
#' @param item Column of items for identifying differences, such as words or
#' bigrams with text data
#' @param feature Column of features between which to compare items, such as
#' documents for text data
#' @param n Column containing item-feature counts
#'
#' @details The arguments \code{item}, \code{feature}, and \code{n}
#' are passed by expression and support \link[rlang]{quasiquotation};
#' you can unquote strings and symbols. Grouping is preserved but ignored.
#'
#'
#' The dataset must have exactly one row per document-term combination for
#' this calculation to succeed. Read Monroe, Colaresi, and Quinn (2017) for
#' more on the weighted log odds ratio.
#'
#' @source <https://doi.org/10.1093/pan/mpn018>
#'
#' @examples
#'
#' library(dplyr)
#'
#' gear_counts <- mtcars %>%
#' count(vs, gear)
#'
#' gear_counts
#'
#' # find the number of gears most characteristic of each engine shape `vs`
#' gear_counts %>%
#' bind_log_odds(gear, vs, n)
#'
#' @importFrom rlang enquo as_name is_empty sym
#' @importFrom dplyr count left_join mutate rename group_by ungroup group_vars
#' @export

bind_log_odds <- function(tbl, item, feature, n) {
item <- enquo(item)
feature <- enquo(feature)
n_col <- enquo(n)

## groups are preserved but ignored
grouping <- group_vars(tbl)
tbl <- ungroup(tbl)

freq1_df <- count(tbl, !!item, wt = !!n_col)
freq1_df <- rename(freq1_df, freq1 = n)

freq2_df <- count(tbl, !!feature, wt = !!n_col)
freq2_df <- rename(freq2_df, freq2 = n)

df_joined <- left_join(tbl, freq1_df, by = as_name(item))
df_joined <- mutate(df_joined, freqnotthem = freq1 - !!n_col)
df_joined <- mutate(df_joined, total = sum(!!n_col))
df_joined <- left_join(df_joined, freq2_df, by = as_name(feature))
df_joined <- mutate(df_joined,
freq2notthem = total - freq2,
l1them = (!!n_col + freq1) / ((total + freq2) - (!!n_col + freq1)),
l2notthem = (freqnotthem + freq1) / ((total + freq2notthem) - (freqnotthem + freq1)),
sigma2 = 1/(!!n_col + freq1) + 1/(freqnotthem + freq1),
log_odds = (log(l1them) - log(l2notthem)) / sqrt(sigma2))

tbl$log_odds <- df_joined$log_odds

if (!is_empty(grouping)) {
tbl <- group_by(tbl, !!sym(grouping))
}

tbl
}

4 changes: 4 additions & 0 deletions R/globals.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
globalVariables(c(
"freq1", "freq2", "freqnotthem", "freq2notthem", "l1them", "l2notthem",
"sigma2", "total"
))
96 changes: 96 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```

# tidylo: Tidy Log Odds Ratio Weighted by Uninformative Prior

**Authors:** [Julia Silge](https://juliasilge.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)<br/>
**License:** [MIT](https://opensource.org/licenses/MIT)


<!-- badges: start -->
<!-- badges: end -->

How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful?

Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2017)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior.

## Installation

~You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with:~

```{r eval=FALSE}
install.packages("tidylo")
```


Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes):

```{r, eval=FALSE}
library(remotes)
install_github("juliasilge/tidylo")
```

## Example

Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words.

```{r}
library(dplyr)
library(janeaustenr)
library(tidytext)
tidy_bigrams <- austen_books() %>%
unnest_tokens(bigram, text, token="ngrams", n = 2)
bigram_counts <- tidy_bigrams %>%
count(book, bigram, sort = TRUE)
bigram_counts
```

Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books?

```{r}
library(tidylo)
bigram_log_odds <- bigram_counts %>%
bind_log_odds(bigram, book, n)
bigram_log_odds %>%
arrange(-log_odds)
```

The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve proper nouns. We can make a visualization as well.

```{r bigram_plot, fig.width=10, fig.height=7}
library(ggplot2)
bigram_log_odds %>%
group_by(book) %>%
top_n(10) %>%
ungroup %>%
mutate(bigram = reorder(bigram, log_odds)) %>%
ggplot(aes(bigram, log_odds, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, scales = "free") +
coord_flip() +
labs(x = NULL)
```

### Community Guidelines

This project is released with a
[Contributor Code of Conduct](https://github.com/juliasilge/tidylo/blob/master/CONDUCT.md).
By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](http://github.com/juliasilge/tidylos/issues).

Loading

0 comments on commit e117ba8

Please sign in to comment.