the data generated has too many <unk> #12

LNdoremi · 2022-03-31T10:07:42Z

Hi,
I am using your method to generate synthetic data for NER, the dataset I use is the conll++ and conll03, but I found that the output data has over 10,000 tokens. Some of them are even given a ner tag.
I hope if you could give me some tips on solving this issue.

Bosheng2020 · 2022-07-12T07:39:28Z

Hi, you can filter the generated data by using some rules, e.g. remove those generated data that have invalid NER tags. You can also use a NER model to filter the generated data. Please refer to Section 2.4 in this paper: https://aclanthology.org/2021.acl-long.453.pdf. To reduce the number of , you can also adjust the criteria to replace the tokens with .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the data generated has too many <unk> #12

the data generated has too many <unk> #12

LNdoremi commented Mar 31, 2022

Bosheng2020 commented Jul 12, 2022 •

edited

Loading

the data generated has too many <unk> #12

the data generated has too many <unk> #12

Comments

LNdoremi commented Mar 31, 2022

Bosheng2020 commented Jul 12, 2022 • edited Loading

Bosheng2020 commented Jul 12, 2022 •

edited

Loading