-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
step_tfidf outputs sparsed data #180
Comments
Hello @PursuitOfDataScience ! If you want the calculations to stay the same, you could use a blueprint to make {recipes} output a sparse matrix. We have an example of how to do that there: https://smltar.com/appendixbaseline.html. The main thing to look out for is to make sure that the model engine can handle sparse data. |
Thanks for the prompt feedback. It seems like even some algorithms can handle the sparsity, a lot of zeros resulted from |
You are going to have a hard time no matter what you when working with tweets since the short texts provide little information. The number of zeroes shouldn't make a difference here. You could try to use word word embeddings. There have been requests for #140, the main thing I haven't figured out yet if is the trained log-odds contains information that allows us to reapply the transformation again, which would be required in a recipe step. No matter what you do, you most likely don't have much information, and I would encourage you to look into the preprocessing of the text; how are the tokenization being performed, handling punctuations, emojis, etc etc. |
Thanks! |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
I used
textrecipes
to make predictions of tweets.When
step_tfidf()
orstep_tf()
being used and baked, the generated dataset is super sparsed with tons of zeros, which breaks some algorithms likenaive_Bayes()
andsvm_linear()
. I am wondering iftextrecipes
would offer something that doesn't give me such a result? Thanks.The text was updated successfully, but these errors were encountered: