Initial commit 🎉

juliasilge · Jun 13, 2019 · e117ba8 · e117ba8
commit e117ba8
Show file tree

Hide file tree

Showing 17 changed files with 683 additions and 0 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -0,0 +1,5 @@
+^.*\.Rproj$
+^\.Rproj\.user$
+^LICENSE\.md$
+^README\.Rmd$
+^CODE_OF_CONDUCT\.md$
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,25 @@
+# Contributor Code of Conduct
+
+As contributors and maintainers of this project, we pledge to respect all people who 
+contribute through reporting issues, posting feature requests, updating documentation,
+submitting pull requests or patches, and other activities.
+
+We are committed to making participation in this project a harassment-free experience for
+everyone, regardless of level of experience, gender, gender identity and expression,
+sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
+
+Examples of unacceptable behavior by participants include the use of sexual language or
+imagery, derogatory comments or personal attacks, trolling, public or private harassment,
+insults, or other unprofessional conduct.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments,
+commits, code, wiki edits, issues, and other contributions that are not aligned to this 
+Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 
+from the project team.
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 
+opening an issue or contacting one or more of the project maintainers.
+
+This Code of Conduct is adapted from the Contributor Covenant 
+(https://www.contributor-covenant.org), version 1.0.0, available at 
+https://contributor-covenant.org/version/1/0/0/.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -0,0 +1,27 @@
+Package: tidylo
+Type: Package
+Title: Tidy Log Odds Ratio Weighted by Uninformative Prior
+Version: 0.1.0.900
+Authors@R: c(person("Tyler", "Schnoebelen", role = c("aut"),
+             email = "tylersemailhere"),
+             person("Julia", "Silge", role = c("aut", "cre"), 
+             email = "[email protected]", 
+             comment = c(ORCID = "0000-0002-3671-836X")))
+Description: Calculate the log odds ratio, weighted by an uninformative 
+    Dirichlet prior, using tidy data principles.
+License: MIT + file LICENSE
+Encoding: UTF-8
+LazyData: TRUE
+URL: http://github.com/juliasilge/tidylo
+BugReports: http://github.com/juliasilge/tidylo/issues
+Imports: rlang,
+    dplyr
+Suggests: 
+    knitr,
+    rmarkdown,
+    tidytext,
+    janeaustenr,
+    ggplot2,
+    testthat (>= 2.1.0)
+VignetteBuilder: knitr
+RoxygenNote: 6.1.1
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2019
+COPYRIGHT HOLDER: Julia Silge and Tyler Schnoebelen
diff --git a/LICENSE.md b/LICENSE.md
@@ -0,0 +1,21 @@
+# MIT License
+
+Copyright (c) 2019 Julia Silge and Tyler Schnoebelen
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/NAMESPACE b/NAMESPACE
@@ -0,0 +1,14 @@
+# Generated by roxygen2: do not edit by hand
+
+export(bind_log_odds)
+importFrom(dplyr,count)
+importFrom(dplyr,group_by)
+importFrom(dplyr,group_vars)
+importFrom(dplyr,left_join)
+importFrom(dplyr,mutate)
+importFrom(dplyr,rename)
+importFrom(dplyr,ungroup)
+importFrom(rlang,as_name)
+importFrom(rlang,enquo)
+importFrom(rlang,is_empty)
+importFrom(rlang,sym)
diff --git a/R/bind_log_odds.R b/R/bind_log_odds.R
@@ -0,0 +1,77 @@
+#' Bind the weighted log odds to a tidy dataset
+#'
+#' Calculate and bind the log odds ratio, weighted by an uninformative Dirichlet
+#' prior, of a tidy dataset to the dataset itself. The weighted log odds ratio
+#' is added as a column. This functions supports non-standard evaluation through
+#' the tidyeval framework.
+#'
+#' @param tbl A tidy dataset with one row per item and feature
+#' @param item Column of items for identifying differences, such as words or
+#' bigrams with text data
+#' @param feature Column of features between which to compare items, such as
+#' documents for text data
+#' @param n Column containing item-feature counts
+#'
+#' @details The arguments \code{item}, \code{feature}, and \code{n}
+#' are passed by expression and support \link[rlang]{quasiquotation};
+#' you can unquote strings and symbols. Grouping is preserved but ignored.
+#'
+#'
+#' The dataset must have exactly one row per document-term combination for
+#' this calculation to succeed. Read Monroe, Colaresi, and Quinn (2017) for
+#' more on the weighted log odds ratio.
+#'
+#' @source <https://doi.org/10.1093/pan/mpn018>
+#'
+#' @examples
+#'
+#' library(dplyr)
+#'
+#' gear_counts <- mtcars %>%
+#'   count(vs, gear)
+#'
+#' gear_counts
+#'
+#' # find the number of gears most characteristic of each engine shape `vs`
+#' gear_counts %>%
+#'   bind_log_odds(gear, vs, n)
+#'
+#' @importFrom rlang enquo as_name is_empty sym
+#' @importFrom dplyr count left_join mutate rename group_by ungroup group_vars
+#' @export
+
+bind_log_odds <- function(tbl, item, feature, n) {
+    item <- enquo(item)
+    feature <- enquo(feature)
+    n_col <- enquo(n)
+
+    ## groups are preserved but ignored
+    grouping <- group_vars(tbl)
+    tbl <- ungroup(tbl)
+
+    freq1_df <- count(tbl, !!item, wt = !!n_col)
+    freq1_df <- rename(freq1_df, freq1 = n)
+
+    freq2_df <- count(tbl, !!feature, wt = !!n_col)
+    freq2_df <- rename(freq2_df, freq2 = n)
+
+    df_joined <- left_join(tbl, freq1_df, by = as_name(item))
+    df_joined <- mutate(df_joined, freqnotthem = freq1 - !!n_col)
+    df_joined <- mutate(df_joined, total = sum(!!n_col))
+    df_joined <- left_join(df_joined, freq2_df, by = as_name(feature))
+    df_joined <- mutate(df_joined,
+                        freq2notthem = total - freq2,
+                        l1them = (!!n_col + freq1) / ((total + freq2) - (!!n_col + freq1)),
+                        l2notthem = (freqnotthem + freq1) / ((total + freq2notthem) - (freqnotthem + freq1)),
+                        sigma2 = 1/(!!n_col + freq1) + 1/(freqnotthem + freq1),
+                        log_odds = (log(l1them) - log(l2notthem)) / sqrt(sigma2))
+
+    tbl$log_odds <- df_joined$log_odds
+
+    if (!is_empty(grouping))  {
+        tbl <- group_by(tbl, !!sym(grouping))
+        }
+
+    tbl
+}
+
diff --git a/R/globals.R b/R/globals.R
@@ -0,0 +1,4 @@
+globalVariables(c(
+    "freq1", "freq2", "freqnotthem", "freq2notthem", "l1them", "l2notthem",
+    "sigma2", "total"
+))
diff --git a/README.Rmd b/README.Rmd
@@ -0,0 +1,96 @@
+
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  fig.path = "man/figures/README-",
+  out.width = "100%"
+)
+suppressPackageStartupMessages(library(ggplot2))
+theme_set(theme_light())
+```
+
+# tidylo: Tidy Log Odds Ratio Weighted by Uninformative Prior
+
+**Authors:** [Julia Silge](https://juliasilge.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)<br/>
+**License:** [MIT](https://opensource.org/licenses/MIT)
+
+
+<!-- badges: start -->
+<!-- badges: end -->
+
+How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful? 
+
+Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2017)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior.
+
+## Installation
+
+~You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with:~
+
+```{r eval=FALSE}
+install.packages("tidylo")
+```
+
+
+Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes):
+
+```{r, eval=FALSE}
+library(remotes)
+install_github("juliasilge/tidylo")
+```
+
+## Example
+
+Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words.
+
+```{r}
+library(dplyr)
+library(janeaustenr)
+library(tidytext)
+
+tidy_bigrams <- austen_books() %>%
+     unnest_tokens(bigram, text, token="ngrams", n = 2)
+
+bigram_counts <- tidy_bigrams %>%
+     count(book, bigram, sort = TRUE)
+
+bigram_counts
+```
+
+Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books?
+
+```{r}
+library(tidylo)
+
+bigram_log_odds <- bigram_counts %>%
+  bind_log_odds(bigram, book, n) 
+
+bigram_log_odds %>%
+  arrange(-log_odds)
+```
+
+The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve proper nouns. We can make a visualization as well.
+
+```{r bigram_plot, fig.width=10, fig.height=7}
+library(ggplot2)
+
+bigram_log_odds %>%
+  group_by(book) %>%
+  top_n(10) %>%
+  ungroup %>%
+  mutate(bigram = reorder(bigram, log_odds)) %>%
+  ggplot(aes(bigram, log_odds, fill = book)) +
+  geom_col(show.legend = FALSE) +
+  facet_wrap(~book, scales = "free") +
+  coord_flip() +
+  labs(x = NULL)
+```
+
+### Community Guidelines
+
+This project is released with a
+[Contributor Code of Conduct](https://github.com/juliasilge/tidylo/blob/master/CONDUCT.md).
+By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](http://github.com/juliasilge/tidylos/issues).
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		YEAR: 2019
		COPYRIGHT HOLDER: Julia Silge and Tyler Schnoebelen