step_tfidf outputs sparsed data #180

PursuitOfDataScience · 2022-05-02T19:49:51Z

I used textrecipes to make predictions of tweets.

When step_tfidf() or step_tf() being used and baked, the generated dataset is super sparsed with tons of zeros, which breaks some algorithms like naive_Bayes() and svm_linear(). I am wondering if textrecipes would offer something that doesn't give me such a result? Thanks.

retweets_rec <- recipe(retweet ~ full_text,
                      data = retweets_train) %>%
  step_tokenize(full_text) %>%
  step_stopwords(full_text) %>%
  step_tokenfilter(full_text, max_tokens = 100) %>%
  step_tfidf(full_text) 

retweets_rec %>%
  prep() %>%
  bake(new_data = NULL)

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2022-05-02T19:59:21Z

Hello @PursuitOfDataScience !

If you want the calculations to stay the same, you could use a blueprint to make {recipes} output a sparse matrix. We have an example of how to do that there: https://smltar.com/appendixbaseline.html. The main thing to look out for is to make sure that the model engine can handle sparse data.

PursuitOfDataScience · 2022-05-02T20:19:51Z

Thanks for the prompt feedback. It seems like even some algorithms can handle the sparsity, a lot of zeros resulted from step_tf_idf() would make the classification difficult. It would be nice to have each row with different values, yet it seems to me that neither tf_idf nor tf can achieve the desired results, but it still depends on the provided text data. Any other function that is related to step_tf() or step_idf() that you can think about? Maybe something like weighted log-odds.

EmilHvitfeldt · 2022-05-02T21:51:51Z

You are going to have a hard time no matter what you when working with tweets since the short texts provide little information. The number of zeroes shouldn't make a difference here. You could try to use word word embeddings. There have been requests for #140, the main thing I haven't figured out yet if is the trained log-odds contains information that allows us to reapply the transformation again, which would be required in a recipe step.

No matter what you do, you most likely don't have much information, and I would encourage you to look into the preprocessing of the text; how are the tokenization being performed, handling punctuations, emojis, etc etc.

PursuitOfDataScience · 2022-05-03T14:13:24Z

Thanks!

github-actions · 2022-05-18T01:35:37Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

PursuitOfDataScience closed this as completed May 3, 2022

github-actions bot locked and limited conversation to collaborators May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step_tfidf outputs sparsed data #180

step_tfidf outputs sparsed data #180

PursuitOfDataScience commented May 2, 2022

EmilHvitfeldt commented May 2, 2022

PursuitOfDataScience commented May 2, 2022

EmilHvitfeldt commented May 2, 2022

PursuitOfDataScience commented May 3, 2022

github-actions bot commented May 18, 2022

step_tfidf outputs sparsed data #180

step_tfidf outputs sparsed data #180

Comments

PursuitOfDataScience commented May 2, 2022

EmilHvitfeldt commented May 2, 2022

PursuitOfDataScience commented May 2, 2022

EmilHvitfeldt commented May 2, 2022

PursuitOfDataScience commented May 3, 2022

github-actions bot commented May 18, 2022