Add HalOmi code

facebookresearch · Nov 21, 2023 · 27a7d18 · 27a7d18
1 parent 35a39b8
commit 27a7d18
Show file tree

Hide file tree

Showing 12 changed files with 1,726 additions and 0 deletions.
diff --git a/demo/halomi/LICENSE b/demo/halomi/LICENSE
diff --git a/demo/halomi/README.md b/demo/halomi/README.md
@@ -0,0 +1,215 @@
+# HalOmi dataset
+
+HalOmi is a small corpus of sentence translations between 9 languages,
+obtained with a NLLB-200 model and manually annotated for translation hallucinations and omissions.
+
+It is intended for benchmarking methods of detection of hallucinations and omissions at the sentence and word levels.
+
+The dataset is described and applied in the paper [HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation](https://arxiv.org/abs/2305.11746).
+
+The dataset includes the following languages:
+
+- High-resource ones:
+  - arb_Arab (Modern Standard Arabic)
+  - deu_Latn (German)
+  - eng_Latn (English)
+  - rus_Cyrl (Russian)
+  - spa_Latn (Spanish)
+  - zho_Hans (Mandarin)
+- Lower-resource ones:
+  - kas_Deva (Kashmiri)
+  - mni_Beng (Manipuri, also known as Meitei)
+  - yor_Latn (Yoruba)
+
+For each of the 8 non-English languages, the dataset includes translation to and from English.
+Additionally, there is a zero-shot translation direction between Spanish and Yoruba.
+
+The dataset is intended as a test set for benchmarking methods of hallucinations and omissions detection.
+We recommend using only the subset of natural translations for this purpose.
+We do not provide the data splits because the dataset is small.
+If the evaluated method requires tuning some parameters, we recommend using cross-validation.
+
+The code for reproducing all the predicted scores and for computing evaluation metrics on this dataset
+is released in the current directory and described below.
+
+The code and most of the data is licensed under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
+However, portions of the dataset are available under separate license terms:
+text sourced from [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200),
+[Jigsaw Toxic Comment Classification](https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/),
+and [Wikipedia](https://dumps.wikimedia.org/)
+are licensed under [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/).
+
+The dataset can be downloaded as a zip archive [from this url](https://dl.fbaipublicfiles.com/nllb/halomi_release_v2.zip).
+
+## An example evaluation script
+
+To reproduce the evaluation of all detection methods, please install the packages from `requirements.txt`,
+download and unpack the dataset (it will be extracted into the `data` directory):
+
+```bash
+pip install -r requirements.txt
+wget https://dl.fbaipublicfiles.com/nllb/halomi_release_v2.zip
+unzip halomi_release_v2.zip
+```
+
+Then you can run the script `reproduce_evaluation.py`. The following output is expected:
+
+```
+Reproducing sentence-level scores...
+
+Direction-wise mean score for hallucination detection:
+score_log_loss       0.796786
+score_alti_mean      0.747689
+score_alti_t_mean    0.579903
+score_attn_ot        0.533654
+score_comet_qe       0.748242
+score_labse          0.780582
+score_laser          0.753988
+score_xnli           0.669939
+score_blaser2_qe     0.825663
+
+... (part of the output omitted)
+
+Everything reproduced as expected!
+```
+
+Please note that:
+
+- due to a mistake found after releasing [the v1 version of the paper](https://arxiv.org/abs/2305.11746v1),
+  the scores for omission detection are going to be slightly different from the ones in that version,
+  and will correspond instead to the final version of the paper.
+- after releasing the paper, we found that a part of the `score_log_loss` values were computed incorrectly.
+  The original scores are stored in the `score_log_loss_legacy` column of the dataset,
+  for reproducing the exact numbers from the paper, and the `score_log_loss` columns contain corrected scores,
+  which demonstrate slightly higher correlation with human judgements.
+
+## Dataset description
+
+The dataset was created by translating open source sentences with an NLLB model, pre-selecting them with automatic translation quality metrics (e.g. BLEU), and then manually annotating them with sentence- and word-level labels of hallucinations and omissions.
+
+For more details, please read the [accompanying paper](#Citations).
+
+The dataset is released in 4 files:
+
+- `halomi_core.tsv` - the main dataset with the results of human annotation.
+- `halomi_full.tsv` - an extended version of the main dataset,
+  including more rows (perturbed translations) and more columns (normalized texts and labels, sentence-level predicted scores).
+- `halomi_full_source_tokens.tsv` and `halomi_full_target_tokens.tsv` - token-level predictions and labels
+  for omissions and hallucinations, respectively.
+
+### Data fields
+
+`halomi_core.tsv` (the first 9 fields) and `halomi_full.tsv` (all fields):
+
+| Field                 | Description                                                                                                                     |
+| --------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
+| src_lang              | Source language code                                                                                                            |
+| tgt_lang              | Target language code                                                                                                            |
+| src_text              | Source text (raw)                                                                                                               |
+| mt_text               | Translated text (raw)                                                                                                           |
+| omit_spans            | Source text, with the omitted parts enclosed in `<<<>>>`                                                                        |
+| hall_spans            | Translation text, with the hallucinated parts enclosed in `<<<>>>`                                                              |
+| class_hall            | Human annotation of sentence-level hallucination degree                                                                         |
+| class_omit            | Human annotation of sentence-level omission degree                                                                              |
+| data_source           | Source of texts, `wiki` or `flores`                                                                                             |
+| hall_mask             | Character-level hallucination mask (w.r.t. `mt_text`)                                                                           |
+| omit_mask             | Character-level omission mask (w.r.t. `src_text`)                                                                               |
+| src_text_normalized   | `src_text` with sentencepiece `nmt_nfkc` normalization used in NLLB                                                             |
+| mt_text_normalized    | `mt_text` with sentencepiece `nmt_nfkc` normalization used in NLLB                                                              |
+| hall_mask_normalized  | Character-level hallucination mask (w.r.t. `mt_text_normalized`)                                                                |
+| omit_mask_normalized  | Character-level omission mask (w.r.t. `src_text_normalized`)                                                                    |
+| perturbation          | Translation method, either `natural` or `perturbed`                                                                             |
+| direction             | `src_lang` and `tgt_lang` joined by `-`                                                                                         |
+| selection             | Pre-selection method, one of `uniform`, `biased` and `worst`                                                                    |
+| score_log_loss        | sequence log-probability by the translation model                                                                               |
+| score_alti_mean       | Average (over target tokens) total (over source tokens) source-target contribution computed with ALTI+                          |
+| score_alti_t_mean     | Average (over target tokens) maximum (over source tokens) source-target contribution computed with ALTI+                        |
+| score_attn_ot         | "Wass-Combo" score based on attention map of the NLLB model                                                                     |
+| score_comet_qe        | Score predicted by the COMET-QE model                                                                                           |
+| score_labse           | Cosine similarity of LaBSE sentence embeddings                                                                                  |
+| score_laser           | Cosine similarity of LASER-3 sentence embeddings                                                                                |
+| score_xnli            | A product of direct (source=>translation) and reverse (translation=>source) entaiment probabilities, predicted by an XNLI model |
+| score_sonar_cosine    | Cosine similarity of SONAR sentence embeddings                                                                                  |
+| score_blaser2_qe      | Prediction of the BLASER 2.0-QE model based on SONAR embeddings                                                                 |
+| score_log_loss_legacy | A previous, incorrect version of `score_log_loss`, included for reproducibility                                                 |
+
+Note: most of the `score_` columns, except `log_loss` and `attn_ot`, are negated, so that their lower values correspond to better estimated translation quality. The `comet` column is also shifted by 1.
+
+`halomi_full_source_tokens.tsv` and `halomi_full_target_tokens.tsv`:
+
+| Field                      | Description                                                                                            |
+| -------------------------- | ------------------------------------------------------------------------------------------------------ |
+| token                      | Sentencepiece token from the NLLB model                                                                |
+| row_id                     | Index of the sentence in the `halomi_full.tsv` file                                                    |
+| direction                  | `src_lang` and `tgt_lang` joined by `-`                                                                |
+| perturbation               | Translation method, either `natural` or `perturbed`                                                    |
+| label_mask                 | String of human labels (`1`=hallucinated/omitted, `0`=normal) for each character in the token          |
+| token_label                | Human label for the whole token (`1`=hallucinated/omitted, `0`=normal)                                 |
+| token_weight               | Length of token in characters (0 for added tokens)                                                     |
+| start                      | Start position of the token in the corresponding sentence                                              |
+| end                        | End position of the token in the corresponding sentence                                                |
+| score_log_loss             | Log probability of the token by the NLLB model                                                         |
+| score_log_loss_contrastive | Log probability of the token by the NLLB model, minus its probability conditionally on an empty source |
+| score_alti_sum             | Sum of ALTI+ contributions for the token                                                               |
+| score_alti_max             | Maximum of ALTI+ contributions for the token                                                           |
+
+When reading the files, please make sure that the `"na"` tokens are parsed as strings, not as `NaN`.
+Example code:
+
+```Python
+source_token_df = pd.read_csv(
+    os.path.join(data_root, "halomi_full_source_tokens.tsv"),
+    sep="\t",
+    keep_default_na=False,
+)
+```
+
+## On reproducing the dataset creation
+
+### Translations
+
+The translations were produced using the [nllb](https://github.com/facebookresearch/fairseq/tree/nllb) branch of the `fairseq` repo.
+The script `example_translation_script.sh` shows all the parameters that we used to create the translations.
+
+### Detection scores
+
+The script `compute_detection_scores.py` reproduces exactly most of the scores, except
+`score_attn_ot` that requires a large amount of scored reference translations, which we do not release.
+Additionally, our implementation of attention-optimal-transport methods is randomized (but reproducible with a fixed seed).
+However, we did not set the seed at the time we produced the scores, so this affects reproducibility and may lead to slightly varying results.
+
+To run this script, you will have to:
+
+- install the requirements from `requirements-detection.txt`;
+- install `fairseq` at the [nllb](https://github.com/facebookresearch/fairseq/tree/nllb) branch;
+- install [stopes](https://github.com/facebookresearch/stopes);
+- from the [NLLB page](https://github.com/facebookresearch/fairseq/tree/nllb), download
+  the NLLB-200-600M checkpoint, dictionary and Sentencepiece model;
+- rename the dictionary file into `dict.eng_Latn.txt` and put it in the directory that will be used later as `NLLB_DATA_DIR`.
+
+Now you can run the script for reproducing the scores as follows:
+
+```bash
+python compute_detection_scores.py \
+    --data-root=data \
+    --save-filename=scores_reproduction.tsv,
+    --nllb-data-dir={PATH TO THE DIRECTORY WITH dict.eng_Latn.txt} \
+    --nllb-spm-path={PATH TO THE NLLB SPM MODEL} \
+    --nllb-checkpoint={PATH TO THE NLLB PYTORCH MODEL} \
+    --attn-references-path={PATH TO THE ATTN-OT REFERENCES, IF YOU HAVE THEM} \
+    --internal --comet --labse --laser --xnli --sonar
+```
+
+## Citations
+
+To refer to the dataset or evaluation results, please cite:
+
+```bibtex
+@article{dale2023halomi,
+  title={HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation},
+  author={Dale, David and Voita, Elena and Lam, Janice and Hansanti, Prangthip and Ropers, Christophe and Kalbassi, Elahe and Gao, Cynthia and Barrault, Lo{\"\i}c and Costa-juss{\`a}, Marta R},
+  journal={arXiv preprint arXiv:2305.11746},
+  url={https://arxiv.org/abs/2305.11746},
+  year={2023}
+}
+```
diff --git a/demo/halomi/attention_optimal_transport/__init__.py b/demo/halomi/attention_optimal_transport/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/demo/halomi/attention_optimal_transport/att_maps_compute.py b/demo/halomi/attention_optimal_transport/att_maps_compute.py
@@ -0,0 +1,153 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import asyncio
+import typing as tp
+from dataclasses import dataclass
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import torch
+from omegaconf import OmegaConf
+from tqdm.auto import tqdm
+
+import stopes
+from stopes.core.stopes_module import Requirements, StopesModule
+from stopes.eval.alti.alti_metrics.alti_metrics_utils import binarize_pair
+from stopes.eval.alti.alti_metrics.nllb_alti_detector import load_nllb_model
+
+
+def get_attention_maps(data, alti_hub):
+    maps = []
+    for row in tqdm(data.itertuples(), total=data.shape[0]):
+        src_lang, tgt_lang = row.direction.split("-")
+        st, pt, tt = binarize_pair(
+            alti_hub, row.src, row.mt, src_lang=src_lang, tgt_lang=tgt_lang
+        )
+
+        with torch.inference_mode():
+            logits, out = alti_hub.models[0].forward(
+                src_tokens=st.unsqueeze(0).to(alti_hub.device),
+                prev_output_tokens=tt.unsqueeze(0).to(alti_hub.device),
+                src_lengths=torch.tensor(st.shape).to(alti_hub.device),
+            )
+            maps.append(out["attn"][0][0].cpu().numpy().mean(0))
+    return maps
+
+
+@dataclass
+class ScoreConfig:
+    input_files: tp.List[Path]
+    output_dir: Path
+    model_data_dir: str
+    model_spm_path: str
+    model_checkpoint_path: str
+
+
+class AttentionScoreModule(StopesModule):
+    def __init__(self, config):
+        super().__init__(config, ScoreConfig)
+        self.config: ScoreConfig = config
+
+    def requirements(self) -> Requirements:
+        return Requirements(gpus_per_node=1)
+
+    def array(self):
+        return self.config.input_files
+
+    def run(
+        self,
+        iteration_value: tp.Optional[tp.Any] = None,
+        iteration_index: int = 0,
+    ) -> tp.Any:
+        if iteration_value is not None:
+            input_file = iteration_value
+        else:
+            input_file = self.config.input_files[0]
+        output_file = (
+            self.config.output_dir / input_file.with_suffix(".attmaps.npy").name
+        )
+
+        data = pd.read_csv(input_file, sep="\t")
+        alti_hub = load_nllb_model(
+            checkpoint=Path(self.config.model_checkpoint_path),
+            data_dir=Path(self.config.model_data_dir),
+            spm=Path(self.config.model_spm_path),
+            src_lang="eng_Latn",
+            tgt_lang="eng_Latn",
+        )
+        alti_hub.cuda()
+
+        maps = get_attention_maps(data, alti_hub)
+
+        np.save(output_file, maps)
+        return output_file
+
+    def validate(
+        self,
+        output: tp.Any,
+        iteration_value: tp.Optional[tp.Any] = None,
+        iteration_index: int = 0,
+    ) -> bool:
+        return output.exists()
+
+
+async def main(
+    input_dir,
+    output_dir,
+    model_data_dir,
+    model_spm_path,
+    model_checkpoint_path,
+    launcher_cluster,
+    launcher_partiton,
+):
+    input_files = list(Path(input_dir).iterdir())
+    output_dir = Path(output_dir)
+    conf = OmegaConf.structured(
+        ScoreConfig(
+            input_files=input_files,
+            output_dir=output_dir,
+            model_data_dir=model_data_dir,
+            model_spm_path=model_spm_path,
+            model_checkpoint_path=model_checkpoint_path,
+        )
+    )
+    scorer = AttentionScoreModule(conf)
+    print(f"Processing {len(input_files)} files...")
+    print(input_files[:5])
+    launcher = stopes.core.Launcher(
+        log_folder="executor_logs",
+        cluster=launcher_cluster,
+        partition=launcher_partiton,
+        max_jobarray_jobs=1000,
+    )
+    shards = await launcher.schedule(scorer)
+    return shards
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input-dir",
+        type=str,
+        help="directory with input tsv files (with src, mt and direction columns)",
+    )
+    parser.add_argument("--output-dir", type=str, help="directory for output files")
+    parser.add_argument("--model-data-dir", type=str, help="NLLB data directory")
+    parser.add_argument("--model-spm-path", type=str, help="NLLB tokenizer path")
+    parser.add_argument(
+        "--model-checkpoint-path", type=str, help="NLLB checkpoint path"
+    )
+    parser.add_argument(
+        "--launcher-cluster",
+        type=str,
+        help="launcher cluster, typically slurm or local",
+    )
+    parser.add_argument("--launcher-partiton", type=str, help="slurm partition")
+    args = parser.parse_args()
+    asyncio.run(main(**vars(args)))