Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance, EMNLP 2023
The repository of the EMNLP 2023 paper "Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance", see paper.
Resource (size) | Germanic | Romance | Slavic | Indo-Aryan | Afro-Asiatic |
---|---|---|---|---|---|
High (5M) | de, nl | fr, es | ru, cs | hi, bn | ar, he |
Medium (1M) | sv, da | it, pt | pl, bg | kn, mr | mt, ha |
Low (100k) | af, lb | ro, oc | uk, sr | sd, gu | ti, am |
Extremely-Low (50k) | no, is | ast, ca | be, bs | ne, ur | kab, so |
EC40 is a Multilingual Neural Machine Translation (MNMT) Training Dataset intended to better understand and study MNMT and Zero-Shot NMT. It contains 66 Million English-Centric Sentences covering 40 Languages (excluding English) across 5 Language Families, sampled from OPUS Corpus.
Features:
- Wide Resource Spectrum:
- ranging from High(5M) to Medium(1M), Low(100K), and extremely-Low(50K) resources.
- Linguistic Diversity:
- Each language family is represented at every resource level with two languages, highlighting a balanced and inclusive sampling approach.
- As a Benchmark:
- In total, there are 80 English-centric directions for training and 1,640 directions (including all supervised and ZS directions) for evaluation. Therefore, the EC40 dataset also serves as a benchmark to study multilingual and zero-shot MT.
- Multi-parallel Utilization:
- We make use of Ntrex-128 and Flores-200 as our validation and test datasets, respectively, because of their unique multiparallel characteristics, allowing for further analyses.
We highly recommend you use EC40 in this way unless you 1. do not use Fairseq; 2. want to change the SPM dictionary.
-
Download EC40 Fairseq train-val set data-bin. We provide our pre-processed training-validation sets in Fairseq bin format, which is the easiest way to reproduce.
-
Download EC40 Fairseq test set data-bin. We provide the pre-processed test set in Fairseq bin format for direct evaluation as well.
-
download trained SPM Dictionary and Model. You can download our trained SentencePiece Dictionary and Model (
it is also contained under ZS-NMT-Variations/get-val-test-data/spm_dict
) -
Training scripts. Scripts for training baseline models on EC40.
-
Evaluation scripts. Scripts for evaluating baseline models on both Supervised and Zero-Shot directions.
-
Baseline Model Checkpoints. We also provide Checkpoints of baseline models.
Please install cyrtranslit first by pip install cyrtranslit
, which will be used to build test set.
Clone this repo by git clone https://github.com/Smu-Tan/ZS-NMT-Variations.git
, then run scripts under this directory.
To use "Plain" EC40, we provide the Simplified Procedures below:
- Download Plain EC40 Dataset and prepare the val & test sets.
- Train your own SPM dict and model. (Otherwise, Go to the section
Use EC40 as a Benchmark
) - Build the Sharded Dataset
-
Download EC40 Dataset (Plain). Here "Plain" means it is not processed by the BPE, all data are in txt format. EC40 is open-to-use, we carefully pre-processed it. thus, no need to run additional preprocessing commands like deduplication, Moses normalization, etc.
-
Prepare Validation and test set (Plain). We provide the Scripts building the "Plain" validation and test set using Ntrex-128 and Flores-200. Note: we merged the Flores-200 dev and dev-test as the final test set.
-
Script of Building Sharded Dataset. The script template of how to build the sharded dataset if you use the "Plain" dataset. You do not have to follow this step if you want to use Huggingface or other tools than Fairseq.
If you want to re-construct validation and test sets / replace them, we give guidance below to do pre-processing.
Please install cyrtranslit first by pip install cyrtranslit
, which will be used to build test set.
Clone this repo by git clone https://github.com/Smu-Tan/ZS-NMT-Variations.git
, then run scripts under this directory.
-
Step 1: Prepare Validation and test set. We provide the Scripts building the validation and test set using Ntrex-128 and Flores-200. If you want to use the EC40 as a benchmark (with its original SentencePiece dictionary), then you should follow this. Note: we merged the Flores-200 dev and dev-test as the final test set.
-
Step 2: copy val set to fairseq-data-bin-sharded. This step is to make sure the val set is contained in the training set (fairseq training fashion).
Please cite both our paper (tan2023towards) and OPUS (tiedemann2012parallel) when you only use the EC40 training dataset.
@inproceedings{tan2023towards,
title={Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance},
author={Tan, Shaomu and Monz, Christof},
booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
pages={13553--13568},
year={2023}
}
@inproceedings{tiedemann2012parallel,
title={Parallel data, tools and interfaces in OPUS.},
author={Tiedemann, J{\"o}rg},
booktitle={Lrec},
volume={2012},
pages={2214--2218},
year={2012},
organization={Citeseer}
}
Please also cite Ntrex-128 and Flores-200 if you use the same validation and test dataset.
@inproceedings{federmann2022ntrex,
title={NTREX-128--news test references for MT evaluation of 128 languages},
author={Federmann, Christian and Kocmi, Tom and Xin, Ying},
booktitle={Proceedings of the First Workshop on Scaling Up Multilingual Evaluation},
pages={21--24},
year={2022}
}
@article{costa2022no,
title={No language left behind: Scaling human-centered machine translation},
author={Costa-juss{\`a}, Marta R and Cross, James and {\c{C}}elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and others},
journal={arXiv preprint arXiv:2207.04672},
year={2022}
}
- EC40 is sampled from OPUS Corpus. We thank Jörg Tiedemann and other researchers who contributed to the OPUS.
- We thank researchers who contributed to the Ntrex-128 and Flores-200.