Skip to content

Commit

Permalink
vignette about mwe. closes #73
Browse files Browse the repository at this point in the history
  • Loading branch information
EmilHvitfeldt committed Jul 30, 2020
1 parent 32bd18e commit a883d38
Show file tree
Hide file tree
Showing 2 changed files with 140 additions and 0 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@
^\.github/workflows/R-CMD-check\.yaml$
^\.github/workflows/pr-commands\.yaml$
^CODE_OF_CONDUCT\.md$
^vignettes/articles$
139 changes: 139 additions & 0 deletions vignettes/Multi-word-expressions.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
title: "Dealing with Multi-Word Expressions"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{dealing-with-multi-word-expressions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}
library(textrecipes)
```

textrecipes doesn't define any steps to deal with multi-work expressions directly. However, there are still ways you can handle them.
We start by creating a small dataset with common multi-work expressions

```{r}
library(tibble)
example_data <- tibble(text = c("George Washington was a president of the United States",
"New York is a city in United States",
"Many city names are comprised of multiple words"))
example_data
```

## Compound multi-word expressions before tokenizing

On way to deal with this problem is to compound known multi-word expressions before tokenization such that tokenization doesn't split them. We can use `str_replace_all()` to compound words.

We start by creating a named vector of our known multi-word expressions. The names should be the expression as they appear in the text and the value should be the replacement.

```{r}
library(stringr)
mwe <- c("George Washington" = "George_Washington",
"United States" = "United_States",
"New York" = "New_York")
```

Then we can use `str_replace_all()` inside `step_mutate()` to conduct the replacements.

```{r}
rec_spec1 <- recipe(~ text, data = example_data) %>%
step_mutate(text = str_replace_all(text, mwe)) %>%
step_tokenize(text) %>%
step_tf(text)
rec_spec1 %>%
prep() %>%
juice() %>%
names()
```
You need to be careful when you create the replacements. The following tokenizer will split the tokens by non-alphanumeric characters which include the underscore we have used and it will result in the multi-word expressions being split by the tokenizer.

```{r}
alpha_num_tokenizer <- function(x) {
str_split(x, "[^[:alnum:]]")
}
rec_spec2 <- recipe(~ text, data = example_data) %>%
step_mutate(text = str_replace_all(text, mwe)) %>%
step_tokenize(text, custom_token = alpha_num_tokenizer) %>%
step_tf(text)
rec_spec2 %>%
prep() %>%
juice() %>%
names()
```

In that case, then the simplest remedy is to update the `mwe` vector to be more robust.

```{r}
mwe <- c("George Washington" = "GeorgeWashington",
"United States" = "UnitedStates",
"New York" = "NewYork")
rec_spec3 <- recipe(~ text, data = example_data) %>%
step_mutate(text = str_replace_all(text, mwe)) %>%
step_tokenize(text, custom_token = alpha_num_tokenizer) %>%
step_tf(text)
rec_spec3 %>%
prep() %>%
juice() %>%
names()
```

## Create custom tokenizer

Other packages offer methods to infer certain multi-word expressions. The [spacyr](http://spacyr.quanteda.io/) package offers a way to extract entities.

```{r}
library(spacyr)
spacy_tokens <- spacy_parse(example_data$text, entity = TRUE, lemma = FALSE)
spacy_tokens
```

the `entity_consolidate()` function can be used to combine entities can comprise of multiple tokens.

```{r}
entity_consolidate(spacy_tokens)
```

We can then create a function that works with `step_tokenize()`.

```{r}
spacy_entity <- function(x) {
tokens <- spacy_parse(x, entity = TRUE, lemma = FALSE)
tokens <- entity_consolidate(tokens)
token_list <- split(tokens$token, tokens$doc_id)
names(token_list) <- gsub("text", "", names(token_list))
res <- unname(token_list[as.character(seq_along(x))])
empty <- lengths(res) == 0
res[empty] <- lapply(seq_len(sum(empty)), function(x) character(0))
res
}
spacy_entity(example_data$text)
```

We can now supply this custom tokenizer to be used seamlessly with `step_tokenize()`.

```{r}
rec_spec4 <- recipe(~ text, data = example_data) %>%
step_tokenize(text, custom_token = spacy_entity) %>%
step_tf(text)
rec_spec4 %>%
prep() %>%
juice() %>%
names()
```

0 comments on commit a883d38

Please sign in to comment.