Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the data generated has too many <unk> #12

Open
LNdoremi opened this issue Mar 31, 2022 · 1 comment
Open

the data generated has too many <unk> #12

LNdoremi opened this issue Mar 31, 2022 · 1 comment

Comments

@LNdoremi
Copy link

Hi,
I am using your method to generate synthetic data for NER, the dataset I use is the conll++ and conll03, but I found that the output data has over 10,000 tokens. Some of them are even given a ner tag.
I hope if you could give me some tips on solving this issue.

@Bosheng2020
Copy link

Bosheng2020 commented Jul 12, 2022

Hi, you can filter the generated data by using some rules, e.g. remove those generated data that have invalid NER tags. You can also use a NER model to filter the generated data. Please refer to Section 2.4 in this paper: https://aclanthology.org/2021.acl-long.453.pdf. To reduce the number of , you can also adjust the criteria to replace the tokens with .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants