Zero-Shot-Cross-Lingual-NER

Repository for Zero-Resource Cross-Lingual Named Entity Recognition AAAI'20 paper.

News

Dataset released
- dataset for finnish (fi) and arabic (ar) is updated. Please check here.
- the remaining datasets are from conll en, es, de and nl. They are available in the respective source.

Language short form

Language	Three letter	Standard
English	eng	en
Spanish	esp	es
Dutch	ned	nl
German	deu	de
Arabic	arb	ar

This repository uses standard short form of the languages. Note: conll uses three letter short form.

Dataset

English (en) : CoNLL-2003 shared task.
Spanish (es) : CoNLL-2002 shared task.
Dutch (es) : CoNLL-2002 shared task.
German (de) : CoNLL-2003 shared task
finnish (fi) : Adapted from this repository. However, they don't come in a form so that we can perform transfer learning experiments (from en conll NER dataset to fi dataset). We refactored the original source and corrected some tags manually for standardization.
Arabic (ar) : Adapted from here. However, they don't come in a form so that we have a proper train, dev, test split. Dataset comes with 28 manually annotated wikipedia articles. For train, dev and test split creation, we randomly select sentences from each of the article and add it to a train, dev and test split. Split size, train(~90%), dev(~10%), test(~10%). Few tags and/or tokens are manually altered for standardization so that we can perform transfer learning experiments.

If you are using refined Finnish NER dataset please cite the following papers,

@inproceedings{bari19,
	Address     = {New York, USA},
	Author      = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
	Booktitle   = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
	Numpages    = {},
	Publisher   = {AAAI},
	Series      = {AAAI '20},
        pages       = {xx--xx},
	Title       = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
	Year        = {2020},
	url         = {}
}

@article{Ruokolainen_2019,
   title={A Finnish news corpus for named entity recognition},
   ISSN={1574-0218},
   url={http://dx.doi.org/10.1007/s10579-019-09471-7},
   DOI={10.1007/s10579-019-09471-7},
   journal={Language Resources and Evaluation},
   publisher={Springer Science and Business Media LLC},
   author={Ruokolainen, Teemu and Kauppinen, Pekka and Silfverberg, Miikka and Lindén, Krister},
   year={2019},
   month={Aug}
}

If you are using refined Arabic NER dataset please cite the following papers,

@inproceedings{bari19,
	Address     = {New York, USA},
	Author      = {M Saiful Bari and Shafiq Joty and Prathyusha Jwalapuram},
	Booktitle   = {Proceedings of the 34th AAAI Conference on Artificial Intelligence},
	Numpages    = {},
	Publisher   = {AAAI},
	Series      = {AAAI '20},
        pages       = {xx--xx},
	Title       = {{Zero-Resource Cross-Lingual Named Entity Recognition}},
	Year        = {2020},
	url         = {}
}

@inproceedings{AQMAR,
 author = {Mohit, Behrang and Schneider, Nathan and Bhowmick, Rishav and Oflazer, Kemal and Smith, Noah A.},
 title = {Recall-oriented Learning of Named Entities in Arabic Wikipedia},
 booktitle = {EACL},
 series = {EACL '12},
 year = {2012},
 isbn = {978-1-937284-19-0},
 location = {Avignon, France},
 pages = {162--173},
 numpages = {12},
 url = {http://dl.acm.org/citation.cfm?id=2380816.2380839},
 acmid = {2380839},
 publisher = {ACL},
 address = {Stroudsburg, PA, USA},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Zero-Shot-Cross-Lingual-NER

News

Language short form

Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Zero-Shot-Cross-Lingual-NER

News

Language short form

Dataset