Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuc handle_numbers in /lib/metric/utils.py not use #4

Open
manhcntt21 opened this issue Feb 25, 2020 · 6 comments
Open

fuc handle_numbers in /lib/metric/utils.py not use #4

manhcntt21 opened this issue Feb 25, 2020 · 6 comments

Comments

@manhcntt21
Copy link

manhcntt21 commented Feb 25, 2020

I have run with the LexNorm2015 dataset in the paper Adapting Sequence to Sequence Models for Text Normalization in Social Media

See that numbers seem to be copied from input to targer

https://ibb.co/SwgmfTX

But I tried to find out where you used that function in the repos but couldn't find it, can you explain it to me, did you not update the repos near line 47.48 in / lib / trainer /evalutor.py

Or, for punctuation, too, where does the paragraph handle the punctuation

Thank you and hope you will find an answer for me soon.

@Isminoula
Copy link
Owner

Isminoula commented Feb 25, 2020

Hello,

Both punctuation and numbers are not handled externally, the model just learns that those should remain the same, i.e. the models learns to copy from source.
I wrote this function because I thought I would need it but after experimentation I saw the the model can deal with copying information in these cases on its own.

@manhcntt21
Copy link
Author

Ok, tks you.

I have one more question
After I train separately word and spelling, I finally
Tested the Hybrid:
python main.py -eval -logfolder -save_dir hybrid_model -gpu 0 -load_from word_model / model_50_word.pt -char_model spelling_model / model_50_spelling.pt -input hybrid -data_augm -noise_ratio 0.1 -lowercase -bos -eos -batch_size 32 -share

but I don't see f1 of hybrid, instead there is only f1 of word and spelling that I have previously train.

this is file hybrid_model/output

https://drive.google.com/file/d/1j3IXphSMfJAJBPhsRQRhxO1aZtLw8HtR/view?usp=sharing

Thank you very much!

@Isminoula
Copy link
Owner

Isminoula commented Feb 26, 2020

The second (word) evaluation is using the spelling model, just mentioning that this spelling model is used for UNK words and based on whether it has high confidence so you might first want to take a look at whether you have UNKs in your dataset.

If so, these lines are the precision, recall and f1 with the hybrid model:

INFO:main:=======Eval on test set=============
INFO:eval:correct_norm:67381.0, total_norm:70181.0, total_nsw:72622.0
INFO:eval:precision:0.9601031618244255, recall:0.9278317865109746, f1:0.9436916591388137

If it is the same as the basic word model, then make sure that the code is passing from the function handle_unk and if the secondary model is actually used

You can also take a look at the hybrid twitter model output log here.

@manhcntt21
Copy link
Author

manhcntt21 commented Feb 29, 2020

So why when I look in paper, the accuracy values of hybrid models and word models are different.
https://ibb.co/xmkVqZd [paper] (https://github.com/manhcntt21/TextNormSeq2Seq/blob/master/3234-Article-Text-6283-1-10-20190531.pdf)
While according to you explain it is equal, so where do I get it wrong.

Also, since I know number and punctuation are limitations in NLP, can you give me more reasons why this model can handle such cases, thank you very much.

@Isminoula
Copy link
Owner

First of all make sure that the code is passing from the function handle_unk and if the secondary model is actually used

I think there are two reasons that you may not be using the spelling model:

  1. You have no UNK words: check if you actually have any UNK words in the word-level vocabulary.
  2. The errors in UNK words are very infrequent so the model has very low confidence: check which words become UNK and how many counts of these words you have in the spelling vocabulary.

With respect to number and punctuation are limitations, this dataset does not have any punctuation and number normalizations, except for apostrophe ("does'nt --> doesn't"). So the model learns that every time it sees a number or a punctuation, it needs to copy it. It all depends on the frequency of the error: words that are often normalized to another word (unique mappings) are easy for the model to handle. On the other hand, words that have more than one normalization targets (multiple mappings) and infrequent normalizations are tough to capture. Especially for cases that appear 1-2 times, we are still far from perfect.

@manhcntt21
Copy link
Author

ok, tkank you.

I think my secondary model was used because I saw unkown.csv in hybrid_model.

I'm currently training with vietnamese data, but the word achieves a low result of only 68% (on test data) - 20epoches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants