Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_tfidf outputs sparsed data #180

Closed
PursuitOfDataScience opened this issue May 2, 2022 · 5 comments
Closed

step_tfidf outputs sparsed data #180

PursuitOfDataScience opened this issue May 2, 2022 · 5 comments

Comments

@PursuitOfDataScience
Copy link

I used textrecipes to make predictions of tweets.

When step_tfidf() or step_tf() being used and baked, the generated dataset is super sparsed with tons of zeros, which breaks some algorithms like naive_Bayes() and svm_linear(). I am wondering if textrecipes would offer something that doesn't give me such a result? Thanks.

retweets_rec <- recipe(retweet ~ full_text,
                      data = retweets_train) %>%
  step_tokenize(full_text) %>%
  step_stopwords(full_text) %>%
  step_tokenfilter(full_text, max_tokens = 100) %>%
  step_tfidf(full_text) 

retweets_rec %>%
  prep() %>%
  bake(new_data = NULL) 
@EmilHvitfeldt
Copy link
Member

Hello @PursuitOfDataScience !

If you want the calculations to stay the same, you could use a blueprint to make {recipes} output a sparse matrix. We have an example of how to do that there: https://smltar.com/appendixbaseline.html. The main thing to look out for is to make sure that the model engine can handle sparse data.

@PursuitOfDataScience
Copy link
Author

Thanks for the prompt feedback. It seems like even some algorithms can handle the sparsity, a lot of zeros resulted from step_tf_idf() would make the classification difficult. It would be nice to have each row with different values, yet it seems to me that neither tf_idf nor tf can achieve the desired results, but it still depends on the provided text data. Any other function that is related to step_tf() or step_idf() that you can think about? Maybe something like weighted log-odds.

@EmilHvitfeldt
Copy link
Member

You are going to have a hard time no matter what you when working with tweets since the short texts provide little information. The number of zeroes shouldn't make a difference here. You could try to use word word embeddings. There have been requests for #140, the main thing I haven't figured out yet if is the trained log-odds contains information that allows us to reapply the transformation again, which would be required in a recipe step.

No matter what you do, you most likely don't have much information, and I would encourage you to look into the preprocessing of the text; how are the tokenization being performed, handling punctuations, emojis, etc etc.

@PursuitOfDataScience
Copy link
Author

Thanks!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants