fuc handle_numbers in /lib/metric/utils.py not use #4

manhcntt21 · 2020-02-25T03:23:24Z

I have run with the LexNorm2015 dataset in the paper Adapting Sequence to Sequence Models for Text Normalization in Social Media

See that numbers seem to be copied from input to targer

https://ibb.co/SwgmfTX

But I tried to find out where you used that function in the repos but couldn't find it, can you explain it to me, did you not update the repos near line 47.48 in / lib / trainer /evalutor.py

Or, for punctuation, too, where does the paragraph handle the punctuation

Thank you and hope you will find an answer for me soon.

Isminoula · 2020-02-25T19:54:21Z

Hello,

Both punctuation and numbers are not handled externally, the model just learns that those should remain the same, i.e. the models learns to copy from source.
I wrote this function because I thought I would need it but after experimentation I saw the the model can deal with copying information in these cases on its own.

manhcntt21 · 2020-02-26T01:07:07Z

Ok, tks you.

I have one more question
After I train separately word and spelling, I finally
Tested the Hybrid:
python main.py -eval -logfolder -save_dir hybrid_model -gpu 0 -load_from word_model / model_50_word.pt -char_model spelling_model / model_50_spelling.pt -input hybrid -data_augm -noise_ratio 0.1 -lowercase -bos -eos -batch_size 32 -share

but I don't see f1 of hybrid, instead there is only f1 of word and spelling that I have previously train.

this is file hybrid_model/output

https://drive.google.com/file/d/1j3IXphSMfJAJBPhsRQRhxO1aZtLw8HtR/view?usp=sharing

Thank you very much!

Isminoula · 2020-02-26T20:32:33Z

The second (word) evaluation is using the spelling model, just mentioning that this spelling model is used for UNK words and based on whether it has high confidence so you might first want to take a look at whether you have UNKs in your dataset.

If so, these lines are the precision, recall and f1 with the hybrid model:

INFO:main:=======Eval on test set=============
INFO:eval:correct_norm:67381.0, total_norm:70181.0, total_nsw:72622.0
INFO:eval:precision:0.9601031618244255, recall:0.9278317865109746, f1:0.9436916591388137

If it is the same as the basic word model, then make sure that the code is passing from the function handle_unk and if the secondary model is actually used

You can also take a look at the hybrid twitter model output log here.

manhcntt21 · 2020-02-29T16:54:38Z

So why when I look in paper, the accuracy values of hybrid models and word models are different.
https://ibb.co/xmkVqZd [paper] (https://github.com/manhcntt21/TextNormSeq2Seq/blob/master/3234-Article-Text-6283-1-10-20190531.pdf)
While according to you explain it is equal, so where do I get it wrong.

Also, since I know number and punctuation are limitations in NLP, can you give me more reasons why this model can handle such cases, thank you very much.

Isminoula · 2020-03-02T19:23:31Z

First of all make sure that the code is passing from the function handle_unk and if the secondary model is actually used

I think there are two reasons that you may not be using the spelling model:

You have no UNK words: check if you actually have any UNK words in the word-level vocabulary.
The errors in UNK words are very infrequent so the model has very low confidence: check which words become UNK and how many counts of these words you have in the spelling vocabulary.

With respect to number and punctuation are limitations, this dataset does not have any punctuation and number normalizations, except for apostrophe ("does'nt --> doesn't"). So the model learns that every time it sees a number or a punctuation, it needs to copy it. It all depends on the frequency of the error: words that are often normalized to another word (unique mappings) are easy for the model to handle. On the other hand, words that have more than one normalization targets (multiple mappings) and infrequent normalizations are tough to capture. Especially for cases that appear 1-2 times, we are still far from perfect.

manhcntt21 · 2020-03-03T01:16:57Z

ok, tkank you.

I think my secondary model was used because I saw unkown.csv in hybrid_model.

I'm currently training with vietnamese data, but the word achieves a low result of only 68% (on test data) - 20epoches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuc handle_numbers in /lib/metric/utils.py not use #4

fuc handle_numbers in /lib/metric/utils.py not use #4

manhcntt21 commented Feb 25, 2020 •

edited

Loading

Isminoula commented Feb 25, 2020 •

edited

Loading

manhcntt21 commented Feb 26, 2020

Isminoula commented Feb 26, 2020 •

edited

Loading

manhcntt21 commented Feb 29, 2020 •

edited

Loading

Isminoula commented Mar 2, 2020

manhcntt21 commented Mar 3, 2020

fuc handle_numbers in /lib/metric/utils.py not use #4

fuc handle_numbers in /lib/metric/utils.py not use #4

Comments

manhcntt21 commented Feb 25, 2020 • edited Loading

Isminoula commented Feb 25, 2020 • edited Loading

manhcntt21 commented Feb 26, 2020

Isminoula commented Feb 26, 2020 • edited Loading

manhcntt21 commented Feb 29, 2020 • edited Loading

Isminoula commented Mar 2, 2020

manhcntt21 commented Mar 3, 2020

manhcntt21 commented Feb 25, 2020 •

edited

Loading

Isminoula commented Feb 25, 2020 •

edited

Loading

Isminoula commented Feb 26, 2020 •

edited

Loading

manhcntt21 commented Feb 29, 2020 •

edited

Loading