Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add step_glove() #102

Open
EmilHvitfeldt opened this issue Oct 13, 2020 · 5 comments
Open

Add step_glove() #102

EmilHvitfeldt opened this issue Oct 13, 2020 · 5 comments
Labels
feature a feature request or enhancement new steps

Comments

@EmilHvitfeldt
Copy link
Member

use glove() from https://github.com/EmilHvitfeldt/wordsalad

@juliasilge juliasilge added the feature a feature request or enhancement label Nov 13, 2020
@jonthegeek
Copy link
Contributor

Combining {wordsalad} and step_word_embeddings should theoretically make this (and #103 and #104) pretty straightforward to implement! If someone works on this, let me know if you need any help! I'll try to dig into it soon if nobody beats me into it.

@jonthegeek
Copy link
Contributor

@EmilHvitfeldt The slightly hairy part of this is that ... is already used for selectors, so we'll need to explicitly include all of the wordsalad::glove() arguments. I'm picturing those after the dots, but before the other generic step_ arguments. Sound good? If so, I'll knock these out.

@EmilHvitfeldt
Copy link
Member Author

Yes I kinda took a break at these issues because the total amount of parameters is going to be huge since glove is 8 and then we need a couple to handle the aggregation.

I'm thinking the best way forward might be to do something like

step_glove <- function(recipe,
                       ...,
                       role = "predictor",
                       trained = FALSE,
                       columns = NULL,
                       tokenizer = text2vec::space_tokenizer,
                       dim = 10L,
                       window = 5L,
                       min_count = 5L,
                       n_iter = 10L,
                       x_max = 10L,
                       stopwords = character(),
                       aggregation = c("sum", "mean", "min", "max"),
                       aggregation_default = 0,
                       prefix = "glove",
                       skip = FALSE,
                       id = rand_id("glove")) {

But on the other hand, it would be nice if we could eliminate the tokenizer and stopwords and have step_glove receive a tokenlist.

It appears that spacing isn't going to be an issue when stopwords are removed.

library(wordsalad)

set.seed(1)
x <- glove(fairy_tales, x_max = 5, stopwords = "hello")

set.seed(1)

hello_fairy_tales <- stringr::str_replace_all(fairy_tales, " ", " hello ")
x_hello <- glove(hello_fairy_tales, x_max = 5, stopwords = "hello")

identical(x, x_hello)
#> [1] TRUE

Created on 2021-02-05 by the reprex package (v0.3.0)

If you feel up for it, I think having step_glove() receive a tokenlist would be the option we have. You would need to turn the tokenlist back into strings with a non-space separator (properly use \t) to allow and then have that being fed into wordsalad::glove()

@jonthegeek
Copy link
Contributor

That makes sense to me. This is what I'm looking at for the arguments, then:

step_glove <- function(recipe,
                       ...,
                       role = "predictor",
                       trained = FALSE,
                       columns = NULL,
                       dim = 10L,
                       window = 5L,
                       min_count = 5L,
                       n_iter = 10L,
                       x_max = 10L,
                       aggregation = c("sum", "mean", "min", "max"),
                       aggregation_default = 0,
                       prefix = "glove_embed",
                       skip = FALSE,
                       id = rand_id("glove")) {

I went with "glove_embed" for the prefix so we could (for example) easily select all "_embed" predictors. I don't have a specific case where that would be necessary, but I can imagine maybe it being a thing.

Now I need to walk through the logic and figure out which functions/methods we need to write especially for this. I think we can just have the main step_glove function, which would do the glove work then call step_word_embeddings_new, I think. Does that sound right? I haven't written a step in over a year now I think, and I get a little lost in the process.

@jonthegeek
Copy link
Contributor

Er wait, nope, because the selector hasn't done its job until deeper in, so we won't know what to call glove with. So I think it will share some unexported functions with step_word_embeddings, but it won't call step_word_embeddings() directly. Ok, I'm getting there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement new steps
Projects
None yet
Development

No branches or pull requests

3 participants