-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuc handle_numbers in /lib/metric/utils.py not use #4
Comments
Hello, Both punctuation and numbers are not handled externally, the model just learns that those should remain the same, i.e. the models learns to copy from source. |
Ok, tks you. I have one more question but I don't see f1 of hybrid, instead there is only f1 of word and spelling that I have previously train. this is file hybrid_model/output https://drive.google.com/file/d/1j3IXphSMfJAJBPhsRQRhxO1aZtLw8HtR/view?usp=sharing Thank you very much! |
The second (word) evaluation is using the spelling model, just mentioning that this spelling model is used for UNK words and based on whether it has high confidence so you might first want to take a look at whether you have UNKs in your dataset. If so, these lines are the precision, recall and f1 with the hybrid model:
If it is the same as the basic word model, then make sure that the code is passing from the function handle_unk and if the secondary model is actually used You can also take a look at the hybrid twitter model output log here. |
So why when I look in paper, the accuracy values of hybrid models and word models are different. Also, since I know number and punctuation are limitations in NLP, can you give me more reasons why this model can handle such cases, thank you very much. |
First of all make sure that the code is passing from the function handle_unk and if the secondary model is actually used I think there are two reasons that you may not be using the spelling model:
With respect to number and punctuation are limitations, this dataset does not have any punctuation and number normalizations, except for apostrophe ("does'nt --> doesn't"). So the model learns that every time it sees a number or a punctuation, it needs to copy it. It all depends on the frequency of the error: words that are often normalized to another word (unique mappings) are easy for the model to handle. On the other hand, words that have more than one normalization targets (multiple mappings) and infrequent normalizations are tough to capture. Especially for cases that appear 1-2 times, we are still far from perfect. |
ok, tkank you. I think my secondary model was used because I saw unkown.csv in hybrid_model. I'm currently training with vietnamese data, but the word achieves a low result of only 68% (on test data) - 20epoches. |
I have run with the LexNorm2015 dataset in the paper Adapting Sequence to Sequence Models for Text Normalization in Social Media
See that numbers seem to be copied from input to targer
https://ibb.co/SwgmfTX
But I tried to find out where you used that function in the repos but couldn't find it, can you explain it to me, did you not update the repos near line 47.48 in / lib / trainer /evalutor.py
Or, for punctuation, too, where does the paragraph handle the punctuation
Thank you and hope you will find an answer for me soon.
The text was updated successfully, but these errors were encountered: