diff --git a/.Rbuildignore b/.Rbuildignore new file mode 100644 index 0000000..9f1f8b7 --- /dev/null +++ b/.Rbuildignore @@ -0,0 +1,5 @@ +^.*\.Rproj$ +^\.Rproj\.user$ +^LICENSE\.md$ +^README\.Rmd$ +^CODE_OF_CONDUCT\.md$ diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5b6a065 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.Rproj.user +.Rhistory +.RData +.Ruserdata diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..0343527 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,25 @@ +# Contributor Code of Conduct + +As contributors and maintainers of this project, we pledge to respect all people who +contribute through reporting issues, posting feature requests, updating documentation, +submitting pull requests or patches, and other activities. + +We are committed to making participation in this project a harassment-free experience for +everyone, regardless of level of experience, gender, gender identity and expression, +sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. + +Examples of unacceptable behavior by participants include the use of sexual language or +imagery, derogatory comments or personal attacks, trolling, public or private harassment, +insults, or other unprofessional conduct. + +Project maintainers have the right and responsibility to remove, edit, or reject comments, +commits, code, wiki edits, issues, and other contributions that are not aligned to this +Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed +from the project team. + +Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by +opening an issue or contacting one or more of the project maintainers. + +This Code of Conduct is adapted from the Contributor Covenant +(https://www.contributor-covenant.org), version 1.0.0, available at +https://contributor-covenant.org/version/1/0/0/. diff --git a/DESCRIPTION b/DESCRIPTION new file mode 100644 index 0000000..077a20e --- /dev/null +++ b/DESCRIPTION @@ -0,0 +1,27 @@ +Package: tidylo +Type: Package +Title: Tidy Log Odds Ratio Weighted by Uninformative Prior +Version: 0.1.0.900 +Authors@R: c(person("Tyler", "Schnoebelen", role = c("aut"), + email = "tylersemailhere"), + person("Julia", "Silge", role = c("aut", "cre"), + email = "julia.silge@gmail.com", + comment = c(ORCID = "0000-0002-3671-836X"))) +Description: Calculate the log odds ratio, weighted by an uninformative + Dirichlet prior, using tidy data principles. +License: MIT + file LICENSE +Encoding: UTF-8 +LazyData: TRUE +URL: http://github.com/juliasilge/tidylo +BugReports: http://github.com/juliasilge/tidylo/issues +Imports: rlang, + dplyr +Suggests: + knitr, + rmarkdown, + tidytext, + janeaustenr, + ggplot2, + testthat (>= 2.1.0) +VignetteBuilder: knitr +RoxygenNote: 6.1.1 diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..90fb6f5 --- /dev/null +++ b/LICENSE @@ -0,0 +1,2 @@ +YEAR: 2019 +COPYRIGHT HOLDER: Julia Silge and Tyler Schnoebelen diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 0000000..3d55dec --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,21 @@ +# MIT License + +Copyright (c) 2019 Julia Silge and Tyler Schnoebelen + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/NAMESPACE b/NAMESPACE new file mode 100644 index 0000000..a568b08 --- /dev/null +++ b/NAMESPACE @@ -0,0 +1,14 @@ +# Generated by roxygen2: do not edit by hand + +export(bind_log_odds) +importFrom(dplyr,count) +importFrom(dplyr,group_by) +importFrom(dplyr,group_vars) +importFrom(dplyr,left_join) +importFrom(dplyr,mutate) +importFrom(dplyr,rename) +importFrom(dplyr,ungroup) +importFrom(rlang,as_name) +importFrom(rlang,enquo) +importFrom(rlang,is_empty) +importFrom(rlang,sym) diff --git a/R/bind_log_odds.R b/R/bind_log_odds.R new file mode 100644 index 0000000..687800d --- /dev/null +++ b/R/bind_log_odds.R @@ -0,0 +1,77 @@ +#' Bind the weighted log odds to a tidy dataset +#' +#' Calculate and bind the log odds ratio, weighted by an uninformative Dirichlet +#' prior, of a tidy dataset to the dataset itself. The weighted log odds ratio +#' is added as a column. This functions supports non-standard evaluation through +#' the tidyeval framework. +#' +#' @param tbl A tidy dataset with one row per item and feature +#' @param item Column of items for identifying differences, such as words or +#' bigrams with text data +#' @param feature Column of features between which to compare items, such as +#' documents for text data +#' @param n Column containing item-feature counts +#' +#' @details The arguments \code{item}, \code{feature}, and \code{n} +#' are passed by expression and support \link[rlang]{quasiquotation}; +#' you can unquote strings and symbols. Grouping is preserved but ignored. +#' +#' +#' The dataset must have exactly one row per document-term combination for +#' this calculation to succeed. Read Monroe, Colaresi, and Quinn (2017) for +#' more on the weighted log odds ratio. +#' +#' @source +#' +#' @examples +#' +#' library(dplyr) +#' +#' gear_counts <- mtcars %>% +#' count(vs, gear) +#' +#' gear_counts +#' +#' # find the number of gears most characteristic of each engine shape `vs` +#' gear_counts %>% +#' bind_log_odds(gear, vs, n) +#' +#' @importFrom rlang enquo as_name is_empty sym +#' @importFrom dplyr count left_join mutate rename group_by ungroup group_vars +#' @export + +bind_log_odds <- function(tbl, item, feature, n) { + item <- enquo(item) + feature <- enquo(feature) + n_col <- enquo(n) + + ## groups are preserved but ignored + grouping <- group_vars(tbl) + tbl <- ungroup(tbl) + + freq1_df <- count(tbl, !!item, wt = !!n_col) + freq1_df <- rename(freq1_df, freq1 = n) + + freq2_df <- count(tbl, !!feature, wt = !!n_col) + freq2_df <- rename(freq2_df, freq2 = n) + + df_joined <- left_join(tbl, freq1_df, by = as_name(item)) + df_joined <- mutate(df_joined, freqnotthem = freq1 - !!n_col) + df_joined <- mutate(df_joined, total = sum(!!n_col)) + df_joined <- left_join(df_joined, freq2_df, by = as_name(feature)) + df_joined <- mutate(df_joined, + freq2notthem = total - freq2, + l1them = (!!n_col + freq1) / ((total + freq2) - (!!n_col + freq1)), + l2notthem = (freqnotthem + freq1) / ((total + freq2notthem) - (freqnotthem + freq1)), + sigma2 = 1/(!!n_col + freq1) + 1/(freqnotthem + freq1), + log_odds = (log(l1them) - log(l2notthem)) / sqrt(sigma2)) + + tbl$log_odds <- df_joined$log_odds + + if (!is_empty(grouping)) { + tbl <- group_by(tbl, !!sym(grouping)) + } + + tbl +} + diff --git a/R/globals.R b/R/globals.R new file mode 100644 index 0000000..49be6f9 --- /dev/null +++ b/R/globals.R @@ -0,0 +1,4 @@ +globalVariables(c( + "freq1", "freq2", "freqnotthem", "freq2notthem", "l1them", "l2notthem", + "sigma2", "total" +)) diff --git a/README.Rmd b/README.Rmd new file mode 100644 index 0000000..2dc5e0b --- /dev/null +++ b/README.Rmd @@ -0,0 +1,96 @@ + + + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.path = "man/figures/README-", + out.width = "100%" +) +suppressPackageStartupMessages(library(ggplot2)) +theme_set(theme_light()) +``` + +# tidylo: Tidy Log Odds Ratio Weighted by Uninformative Prior + +**Authors:** [Julia Silge](https://juliasilge.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)
+**License:** [MIT](https://opensource.org/licenses/MIT) + + + + + +How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful? + +Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2017)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior. + +## Installation + +~You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with:~ + +```{r eval=FALSE} +install.packages("tidylo") +``` + + +Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes): + +```{r, eval=FALSE} +library(remotes) +install_github("juliasilge/tidylo") +``` + +## Example + +Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words. + +```{r} +library(dplyr) +library(janeaustenr) +library(tidytext) + +tidy_bigrams <- austen_books() %>% + unnest_tokens(bigram, text, token="ngrams", n = 2) + +bigram_counts <- tidy_bigrams %>% + count(book, bigram, sort = TRUE) + +bigram_counts +``` + +Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books? + +```{r} +library(tidylo) + +bigram_log_odds <- bigram_counts %>% + bind_log_odds(bigram, book, n) + +bigram_log_odds %>% + arrange(-log_odds) +``` + +The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve proper nouns. We can make a visualization as well. + +```{r bigram_plot, fig.width=10, fig.height=7} +library(ggplot2) + +bigram_log_odds %>% + group_by(book) %>% + top_n(10) %>% + ungroup %>% + mutate(bigram = reorder(bigram, log_odds)) %>% + ggplot(aes(bigram, log_odds, fill = book)) + + geom_col(show.legend = FALSE) + + facet_wrap(~book, scales = "free") + + coord_flip() + + labs(x = NULL) +``` + +### Community Guidelines + +This project is released with a +[Contributor Code of Conduct](https://github.com/juliasilge/tidylo/blob/master/CONDUCT.md). +By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](http://github.com/juliasilge/tidylos/issues). + diff --git a/README.md b/README.md new file mode 100644 index 0000000..865cf6d --- /dev/null +++ b/README.md @@ -0,0 +1,123 @@ + + + + + +# tidylo: Tidy Log Odds Ratio Weighted by Uninformative Prior + +**Authors:** [Julia Silge](https://juliasilge.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)
+**License:** [MIT](https://opensource.org/licenses/MIT) + + + + + +How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful? + +Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2017)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior. + +## Installation + +~You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with:~ + + +```r +install.packages("tidylo") +``` + + +Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes): + + +```r +library(remotes) +install_github("juliasilge/tidylo") +``` + +## Example + +Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words. + + +```r +library(dplyr) +library(janeaustenr) +library(tidytext) + +tidy_bigrams <- austen_books() %>% + unnest_tokens(bigram, text, token="ngrams", n = 2) + +bigram_counts <- tidy_bigrams %>% + count(book, bigram, sort = TRUE) + +bigram_counts +#> # A tibble: 328,495 x 3 +#> book bigram n +#> +#> 1 Mansfield Park of the 748 +#> 2 Mansfield Park to be 643 +#> 3 Emma to be 607 +#> 4 Mansfield Park in the 578 +#> 5 Emma of the 566 +#> 6 Pride & Prejudice of the 464 +#> 7 Emma it was 448 +#> 8 Emma in the 446 +#> 9 Pride & Prejudice to be 443 +#> 10 Sense & Sensibility to be 436 +#> # … with 328,485 more rows +``` + +Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books? + + +```r +library(tidylo) + +bigram_log_odds <- bigram_counts %>% + bind_log_odds(bigram, book, n) + +bigram_log_odds %>% + arrange(-log_odds) +#> # A tibble: 328,495 x 4 +#> book bigram n log_odds +#> +#> 1 Mansfield Park sir thomas 287 14.8 +#> 2 Pride & Prejudice mr darcy 243 14.5 +#> 3 Emma mr knightley 269 14.3 +#> 4 Sense & Sensibility mrs jennings 199 13.2 +#> 5 Emma mrs weston 229 13.2 +#> 6 Persuasion captain wentworth 170 13.0 +#> 7 Mansfield Park miss crawford 215 12.8 +#> 8 Persuasion mr elliot 147 12.1 +#> 9 Emma mr elton 190 12.0 +#> 10 Mansfield Park mr crawford 162 11.1 +#> # … with 328,485 more rows +``` + +The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve proper nouns. We can make a visualization as well. + + +```r +library(ggplot2) + +bigram_log_odds %>% + group_by(book) %>% + top_n(10) %>% + ungroup %>% + mutate(bigram = reorder(bigram, log_odds)) %>% + ggplot(aes(bigram, log_odds, fill = book)) + + geom_col(show.legend = FALSE) + + facet_wrap(~book, scales = "free") + + coord_flip() + + labs(x = NULL) +#> Selecting by log_odds +``` + +plot of chunk bigram_plot + +### Community Guidelines + +This project is released with a +[Contributor Code of Conduct](https://github.com/juliasilge/tidylo/blob/master/CONDUCT.md). +By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](http://github.com/juliasilge/tidylos/issues). + diff --git a/man/bind_log_odds.Rd b/man/bind_log_odds.Rd new file mode 100644 index 0000000..7a9e12c --- /dev/null +++ b/man/bind_log_odds.Rd @@ -0,0 +1,52 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/bind_log_odds.R +\name{bind_log_odds} +\alias{bind_log_odds} +\title{Bind the weighted log odds to a tidy dataset} +\source{ + +} +\usage{ +bind_log_odds(tbl, item, feature, n) +} +\arguments{ +\item{tbl}{A tidy dataset with one row per item and feature} + +\item{item}{Column of items for identifying differences, such as words or +bigrams with text data} + +\item{feature}{Column of features between which to compare items, such as +documents for text data} + +\item{n}{Column containing item-feature counts} +} +\description{ +Calculate and bind the log odds ratio, weighted by an uninformative Dirichlet +prior, of a tidy dataset to the dataset itself. The weighted log odds ratio +is added as a column. This functions supports non-standard evaluation through +the tidyeval framework. +} +\details{ +The arguments \code{item}, \code{feature}, and \code{n} +are passed by expression and support \link[rlang]{quasiquotation}; +you can unquote strings and symbols. Grouping is preserved but ignored. + + +The dataset must have exactly one row per document-term combination for +this calculation to succeed. Read Monroe, Colaresi, and Quinn (2017) for +more on the weighted log odds ratio. +} +\examples{ + +library(dplyr) + +gear_counts <- mtcars \%>\% + count(vs, gear) + +gear_counts + +# find the number of gears most characteristic of each engine shape `vs` +gear_counts \%>\% + bind_log_odds(gear, vs, n) + +} diff --git a/man/figures/README-bigram_plot-1.png b/man/figures/README-bigram_plot-1.png new file mode 100644 index 0000000..20f023e Binary files /dev/null and b/man/figures/README-bigram_plot-1.png differ diff --git a/tests/testthat.R b/tests/testthat.R new file mode 100644 index 0000000..6f29050 --- /dev/null +++ b/tests/testthat.R @@ -0,0 +1,4 @@ +library(testthat) +library(tidylo) + +test_check("tidylo") diff --git a/tests/testthat/test-bind-log-odds.R b/tests/testthat/test-bind-log-odds.R new file mode 100644 index 0000000..b8f05d6 --- /dev/null +++ b/tests/testthat/test-bind-log-odds.R @@ -0,0 +1,102 @@ +context("Weighted log odds calculation") + +suppressPackageStartupMessages(library(dplyr)) + +w <- tibble( + document = rep(1:2, each = 5), + word = c( + "the", "quick", "brown", "fox", "jumped", + "over", "the", "lazy", "brown", "dog" + ), + frequency = c( + 1, 1, 1, 1, 2, + 1, 1, 1, 1, 2 + ) +) + +test_that("Can calculate weighted log odds", { + result <- w %>% + bind_log_odds(word, document, frequency) + + expect_equal( + select(w, document, word, frequency), + select(result, document, word, frequency) + ) + + expect_is(result, "tbl_df") + expect_is(result$log_odds, "numeric") + expect_equal(sum(result$log_odds[c(2, 4:6, 8, 10)] > 0), 6) + + # preserves but ignores groups + result2 <- w %>% + group_by(document) %>% + bind_log_odds(word, document, frequency) + + expect_equal(length(groups(result2)), 1) + expect_equal(as.character(groups(result2)[[1]]), "document") +}) + + +test_that("Weighted log odds works when the feature is a number", { + z <- dplyr::tibble( + id = rep(c(2, 3), each = 3), + word = c("an", "interesting", "text", "a", "boring", "text"), + n = c(1, 1, 3, 1, 2, 1) + ) + + result <- bind_log_odds(z, word, id, n) + expect_false(any(is.na(result))) + expect_equal(sum(result$log_odds[1:5] > 0), 5) + expect_lt(result$log_odds[6], 0) +}) + + +test_that("Weighted log odds with tidyeval works", { + + w <- tibble( + document = rep(1:2, each = 5), + word = c( + "the", "quick", "brown", "fox", "jumped", + "over", "the", "lazy", "brown", "dog" + ), + frequency = c( + 1, 1, 1, 1, 2, + 1, 1, 1, 1, 2 + ) + ) + termvar <- quo(word) + documentvar <- quo(document) + countvar <- quo(frequency) + + result <- w %>% + bind_log_odds(!!termvar, !!documentvar, !!countvar) + + termvar <- sym("word") + documentvar <- sym("document") + countvar <- sym("frequency") + + result2 <- w %>% + bind_log_odds(!!termvar, !!documentvar, !!countvar) + + + expect_equal( + select(w, document, word, frequency), + select(result, document, word, frequency) + ) + + expect_equal( + select(w, document, word, frequency), + select(result2, document, word, frequency) + ) + + expect_is(result, "tbl_df") + expect_is(result$log_odds, "numeric") + expect_equal(sum(result$log_odds[c(2, 4:6, 8, 10)] > 0), 6) + + result3 <- w %>% + group_by(document) %>% + bind_log_odds(!!termvar, !!documentvar, !!countvar) + + expect_equal(length(groups(result3)), 1) + expect_equal(as.character(groups(result3)[[1]]), "document") +}) diff --git a/tidylo.Rproj b/tidylo.Rproj new file mode 100644 index 0000000..a648ce1 --- /dev/null +++ b/tidylo.Rproj @@ -0,0 +1,20 @@ +Version: 1.0 + +RestoreWorkspace: Default +SaveWorkspace: Default +AlwaysSaveHistory: Default + +EnableCodeIndexing: Yes +UseSpacesForTab: Yes +NumSpacesForTab: 4 +Encoding: UTF-8 + +RnwWeave: Sweave +LaTeX: pdfLaTeX + +AutoAppendNewline: Yes +StripTrailingWhitespace: Yes + +BuildType: Package +PackageUseDevtools: Yes +PackageInstallArgs: --no-multiarch --with-keep.source diff --git a/vignettes/tidy_log_odds.Rmd b/vignettes/tidy_log_odds.Rmd new file mode 100644 index 0000000..d53d2d5 --- /dev/null +++ b/vignettes/tidy_log_odds.Rmd @@ -0,0 +1,107 @@ +--- +title: "Tidy Log Odds" +author: "Julia Silge and Tyler Schnoebelen" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Tidy Log Odds} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + warning = FALSE, message = FALSE, + collapse = TRUE, + comment = "#>") + +suppressPackageStartupMessages(library(ggplot2)) +theme_set(theme_light()) +``` + +## A motivating example: what words are important to a text? + +There are multiple ways to measure which words (or bigrams, or other units of text) are important in a text. You can count words, or [measure tf-idf](https://www.tidytextmining.com/tfidf.html). This package implements a different approach for measuring which words are important to a text, a **weighted log odds**. + +A log odds ratio is a way of expressing probabilities, and we can weight a log odds ratio so that our implementation does a better job dealing with different combinations of words and documents having different counts. In particular, we use the method outlined in [Monroe, Colaresi, and Quinn (2017)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by an uninformative Dirichlet prior. + +What does this mean? It means that by weighting in this way, we take into account the sampling error in our measurements and acknowledge that we are more certain when we've counted something a lot of times and less certain when we've counted something only a few times. When weighting by a prior in this way, we focus on differences that are more likely to be real, given the evidence that we have. + +Let's look at just such an example. + +## Jane Austen and bigrams + +Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words. + +```{r bigram_counts} +library(dplyr) +library(janeaustenr) +library(tidytext) + +tidy_bigrams <- austen_books() %>% + unnest_tokens(bigram, text, token="ngrams", n = 2) + +bigram_counts <- tidy_bigrams %>% + count(book, bigram, sort = TRUE) + +bigram_counts +``` + +Notice that we haven't removed stop words, or filtered out rarely used words. We have done very little pre-processing of this text data. + +Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. What are the highest log odds bigrams for these books? + +```{r bigram_log_odds, dependson="bigram_counts"} +library(tidylo) + +bigram_log_odds <- bigram_counts %>% + bind_log_odds(bigram, book, n) + +bigram_log_odds %>% + arrange(-log_odds) +``` + +The highest log odds bigrams (bigrams more likely to come from each book, compared to the others) involve proper nouns. We can make a visualization as well. + +```{r bigram_plot, dependson="bigram_log_odds", fig.width=10, fig.height=7} +library(ggplot2) + +bigram_log_odds %>% + group_by(book) %>% + top_n(10) %>% + ungroup %>% + mutate(bigram = reorder(bigram, log_odds)) %>% + ggplot(aes(bigram, log_odds, fill = book)) + + geom_col(show.legend = FALSE) + + facet_wrap(~book, scales = "free") + + coord_flip() + + labs(x = NULL) +``` + +These bigrams have the highest log odds for each book. + +Why you might choose log odds over tf-idf? TODO for Tyler + +## Counting things other than words + +Text analysis is a main motivator for this implementation of weighted log odds, but this is a general approach for measuring how much more likely one item (any kind of item, not just a word or bigram) is to be associated than another for some set of features (any kind of feature, not just a document or book). + +To demonstrate this, let's look at everybody's favorite data about cars. What do we know about the relationship between number of gears and engine shape `vs`? + +```{r gear_counts} +gear_counts <- mtcars %>% + count(vs, gear) + +gear_counts +``` + +Now we can use `bind_log_odds()` to find the weighted log odds ratio for each number of gears and engine shape. + +```{r dependson="gear_counts"} +gear_counts %>% + bind_log_odds(gear, vs, n) +``` + +For engine shape `vs = 0`, having three gears has the highest log odds while for engine shape `vs = 1`, hvaing four gears has the highest log odds. This dataset is small enough that you can look at the count data and see how this is working. + +More importantly, you can notice that this approach is useful both in the initial motivating example of text data but also more generally whenever you have counts in some kind of groups and you want to find what is more likely to come from which group, compared to the other groups.