Skip to content

Commit

Permalink
Add HalOmi code
Browse files Browse the repository at this point in the history
  • Loading branch information
avidale committed Nov 21, 2023
1 parent 35a39b8 commit 27a7d18
Show file tree
Hide file tree
Showing 12 changed files with 1,726 additions and 0 deletions.
399 changes: 399 additions & 0 deletions demo/halomi/LICENSE

Large diffs are not rendered by default.

215 changes: 215 additions & 0 deletions demo/halomi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# HalOmi dataset

HalOmi is a small corpus of sentence translations between 9 languages,
obtained with a NLLB-200 model and manually annotated for translation hallucinations and omissions.

It is intended for benchmarking methods of detection of hallucinations and omissions at the sentence and word levels.

The dataset is described and applied in the paper [HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation](https://arxiv.org/abs/2305.11746).

The dataset includes the following languages:

- High-resource ones:
- arb_Arab (Modern Standard Arabic)
- deu_Latn (German)
- eng_Latn (English)
- rus_Cyrl (Russian)
- spa_Latn (Spanish)
- zho_Hans (Mandarin)
- Lower-resource ones:
- kas_Deva (Kashmiri)
- mni_Beng (Manipuri, also known as Meitei)
- yor_Latn (Yoruba)

For each of the 8 non-English languages, the dataset includes translation to and from English.
Additionally, there is a zero-shot translation direction between Spanish and Yoruba.

The dataset is intended as a test set for benchmarking methods of hallucinations and omissions detection.
We recommend using only the subset of natural translations for this purpose.
We do not provide the data splits because the dataset is small.
If the evaluated method requires tuning some parameters, we recommend using cross-validation.

The code for reproducing all the predicted scores and for computing evaluation metrics on this dataset
is released in the current directory and described below.

The code and most of the data is licensed under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
However, portions of the dataset are available under separate license terms:
text sourced from [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200),
[Jigsaw Toxic Comment Classification](https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/),
and [Wikipedia](https://dumps.wikimedia.org/)
are licensed under [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/).

The dataset can be downloaded as a zip archive [from this url](https://dl.fbaipublicfiles.com/nllb/halomi_release_v2.zip).

## An example evaluation script

To reproduce the evaluation of all detection methods, please install the packages from `requirements.txt`,
download and unpack the dataset (it will be extracted into the `data` directory):

```bash
pip install -r requirements.txt
wget https://dl.fbaipublicfiles.com/nllb/halomi_release_v2.zip
unzip halomi_release_v2.zip
```

Then you can run the script `reproduce_evaluation.py`. The following output is expected:

```
Reproducing sentence-level scores...
Direction-wise mean score for hallucination detection:
score_log_loss 0.796786
score_alti_mean 0.747689
score_alti_t_mean 0.579903
score_attn_ot 0.533654
score_comet_qe 0.748242
score_labse 0.780582
score_laser 0.753988
score_xnli 0.669939
score_blaser2_qe 0.825663
... (part of the output omitted)
Everything reproduced as expected!
```

Please note that:

- due to a mistake found after releasing [the v1 version of the paper](https://arxiv.org/abs/2305.11746v1),
the scores for omission detection are going to be slightly different from the ones in that version,
and will correspond instead to the final version of the paper.
- after releasing the paper, we found that a part of the `score_log_loss` values were computed incorrectly.
The original scores are stored in the `score_log_loss_legacy` column of the dataset,
for reproducing the exact numbers from the paper, and the `score_log_loss` columns contain corrected scores,
which demonstrate slightly higher correlation with human judgements.

## Dataset description

The dataset was created by translating open source sentences with an NLLB model, pre-selecting them with automatic translation quality metrics (e.g. BLEU), and then manually annotating them with sentence- and word-level labels of hallucinations and omissions.

For more details, please read the [accompanying paper](#Citations).

The dataset is released in 4 files:

- `halomi_core.tsv` - the main dataset with the results of human annotation.
- `halomi_full.tsv` - an extended version of the main dataset,
including more rows (perturbed translations) and more columns (normalized texts and labels, sentence-level predicted scores).
- `halomi_full_source_tokens.tsv` and `halomi_full_target_tokens.tsv` - token-level predictions and labels
for omissions and hallucinations, respectively.

### Data fields

`halomi_core.tsv` (the first 9 fields) and `halomi_full.tsv` (all fields):

| Field | Description |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| src_lang | Source language code |
| tgt_lang | Target language code |
| src_text | Source text (raw) |
| mt_text | Translated text (raw) |
| omit_spans | Source text, with the omitted parts enclosed in `<<<>>>` |
| hall_spans | Translation text, with the hallucinated parts enclosed in `<<<>>>` |
| class_hall | Human annotation of sentence-level hallucination degree |
| class_omit | Human annotation of sentence-level omission degree |
| data_source | Source of texts, `wiki` or `flores` |
| hall_mask | Character-level hallucination mask (w.r.t. `mt_text`) |
| omit_mask | Character-level omission mask (w.r.t. `src_text`) |
| src_text_normalized | `src_text` with sentencepiece `nmt_nfkc` normalization used in NLLB |
| mt_text_normalized | `mt_text` with sentencepiece `nmt_nfkc` normalization used in NLLB |
| hall_mask_normalized | Character-level hallucination mask (w.r.t. `mt_text_normalized`) |
| omit_mask_normalized | Character-level omission mask (w.r.t. `src_text_normalized`) |
| perturbation | Translation method, either `natural` or `perturbed` |
| direction | `src_lang` and `tgt_lang` joined by `-` |
| selection | Pre-selection method, one of `uniform`, `biased` and `worst` |
| score_log_loss | sequence log-probability by the translation model |
| score_alti_mean | Average (over target tokens) total (over source tokens) source-target contribution computed with ALTI+ |
| score_alti_t_mean | Average (over target tokens) maximum (over source tokens) source-target contribution computed with ALTI+ |
| score_attn_ot | "Wass-Combo" score based on attention map of the NLLB model |
| score_comet_qe | Score predicted by the COMET-QE model |
| score_labse | Cosine similarity of LaBSE sentence embeddings |
| score_laser | Cosine similarity of LASER-3 sentence embeddings |
| score_xnli | A product of direct (source=>translation) and reverse (translation=>source) entaiment probabilities, predicted by an XNLI model |
| score_sonar_cosine | Cosine similarity of SONAR sentence embeddings |
| score_blaser2_qe | Prediction of the BLASER 2.0-QE model based on SONAR embeddings |
| score_log_loss_legacy | A previous, incorrect version of `score_log_loss`, included for reproducibility |

Note: most of the `score_` columns, except `log_loss` and `attn_ot`, are negated, so that their lower values correspond to better estimated translation quality. The `comet` column is also shifted by 1.

`halomi_full_source_tokens.tsv` and `halomi_full_target_tokens.tsv`:

| Field | Description |
| -------------------------- | ------------------------------------------------------------------------------------------------------ |
| token | Sentencepiece token from the NLLB model |
| row_id | Index of the sentence in the `halomi_full.tsv` file |
| direction | `src_lang` and `tgt_lang` joined by `-` |
| perturbation | Translation method, either `natural` or `perturbed` |
| label_mask | String of human labels (`1`=hallucinated/omitted, `0`=normal) for each character in the token |
| token_label | Human label for the whole token (`1`=hallucinated/omitted, `0`=normal) |
| token_weight | Length of token in characters (0 for added tokens) |
| start | Start position of the token in the corresponding sentence |
| end | End position of the token in the corresponding sentence |
| score_log_loss | Log probability of the token by the NLLB model |
| score_log_loss_contrastive | Log probability of the token by the NLLB model, minus its probability conditionally on an empty source |
| score_alti_sum | Sum of ALTI+ contributions for the token |
| score_alti_max | Maximum of ALTI+ contributions for the token |

When reading the files, please make sure that the `"na"` tokens are parsed as strings, not as `NaN`.
Example code:

```Python
source_token_df = pd.read_csv(
os.path.join(data_root, "halomi_full_source_tokens.tsv"),
sep="\t",
keep_default_na=False,
)
```

## On reproducing the dataset creation

### Translations

The translations were produced using the [nllb](https://github.com/facebookresearch/fairseq/tree/nllb) branch of the `fairseq` repo.
The script `example_translation_script.sh` shows all the parameters that we used to create the translations.

### Detection scores

The script `compute_detection_scores.py` reproduces exactly most of the scores, except
`score_attn_ot` that requires a large amount of scored reference translations, which we do not release.
Additionally, our implementation of attention-optimal-transport methods is randomized (but reproducible with a fixed seed).
However, we did not set the seed at the time we produced the scores, so this affects reproducibility and may lead to slightly varying results.

To run this script, you will have to:

- install the requirements from `requirements-detection.txt`;
- install `fairseq` at the [nllb](https://github.com/facebookresearch/fairseq/tree/nllb) branch;
- install [stopes](https://github.com/facebookresearch/stopes);
- from the [NLLB page](https://github.com/facebookresearch/fairseq/tree/nllb), download
the NLLB-200-600M checkpoint, dictionary and Sentencepiece model;
- rename the dictionary file into `dict.eng_Latn.txt` and put it in the directory that will be used later as `NLLB_DATA_DIR`.

Now you can run the script for reproducing the scores as follows:

```bash
python compute_detection_scores.py \
--data-root=data \
--save-filename=scores_reproduction.tsv,
--nllb-data-dir={PATH TO THE DIRECTORY WITH dict.eng_Latn.txt} \
--nllb-spm-path={PATH TO THE NLLB SPM MODEL} \
--nllb-checkpoint={PATH TO THE NLLB PYTORCH MODEL} \
--attn-references-path={PATH TO THE ATTN-OT REFERENCES, IF YOU HAVE THEM} \
--internal --comet --labse --laser --xnli --sonar
```

## Citations

To refer to the dataset or evaluation results, please cite:

```bibtex
@article{dale2023halomi,
title={HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation},
author={Dale, David and Voita, Elena and Lam, Janice and Hansanti, Prangthip and Ropers, Christophe and Kalbassi, Elahe and Gao, Cynthia and Barrault, Lo{\"\i}c and Costa-juss{\`a}, Marta R},
journal={arXiv preprint arXiv:2305.11746},
url={https://arxiv.org/abs/2305.11746},
year={2023}
}
```
5 changes: 5 additions & 0 deletions demo/halomi/attention_optimal_transport/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
153 changes: 153 additions & 0 deletions demo/halomi/attention_optimal_transport/att_maps_compute.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import argparse
import asyncio
import typing as tp
from dataclasses import dataclass
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from omegaconf import OmegaConf
from tqdm.auto import tqdm

import stopes
from stopes.core.stopes_module import Requirements, StopesModule
from stopes.eval.alti.alti_metrics.alti_metrics_utils import binarize_pair
from stopes.eval.alti.alti_metrics.nllb_alti_detector import load_nllb_model


def get_attention_maps(data, alti_hub):
maps = []
for row in tqdm(data.itertuples(), total=data.shape[0]):
src_lang, tgt_lang = row.direction.split("-")
st, pt, tt = binarize_pair(
alti_hub, row.src, row.mt, src_lang=src_lang, tgt_lang=tgt_lang
)

with torch.inference_mode():
logits, out = alti_hub.models[0].forward(
src_tokens=st.unsqueeze(0).to(alti_hub.device),
prev_output_tokens=tt.unsqueeze(0).to(alti_hub.device),
src_lengths=torch.tensor(st.shape).to(alti_hub.device),
)
maps.append(out["attn"][0][0].cpu().numpy().mean(0))
return maps


@dataclass
class ScoreConfig:
input_files: tp.List[Path]
output_dir: Path
model_data_dir: str
model_spm_path: str
model_checkpoint_path: str


class AttentionScoreModule(StopesModule):
def __init__(self, config):
super().__init__(config, ScoreConfig)
self.config: ScoreConfig = config

def requirements(self) -> Requirements:
return Requirements(gpus_per_node=1)

def array(self):
return self.config.input_files

def run(
self,
iteration_value: tp.Optional[tp.Any] = None,
iteration_index: int = 0,
) -> tp.Any:
if iteration_value is not None:
input_file = iteration_value
else:
input_file = self.config.input_files[0]
output_file = (
self.config.output_dir / input_file.with_suffix(".attmaps.npy").name
)

data = pd.read_csv(input_file, sep="\t")
alti_hub = load_nllb_model(
checkpoint=Path(self.config.model_checkpoint_path),
data_dir=Path(self.config.model_data_dir),
spm=Path(self.config.model_spm_path),
src_lang="eng_Latn",
tgt_lang="eng_Latn",
)
alti_hub.cuda()

maps = get_attention_maps(data, alti_hub)

np.save(output_file, maps)
return output_file

def validate(
self,
output: tp.Any,
iteration_value: tp.Optional[tp.Any] = None,
iteration_index: int = 0,
) -> bool:
return output.exists()


async def main(
input_dir,
output_dir,
model_data_dir,
model_spm_path,
model_checkpoint_path,
launcher_cluster,
launcher_partiton,
):
input_files = list(Path(input_dir).iterdir())
output_dir = Path(output_dir)
conf = OmegaConf.structured(
ScoreConfig(
input_files=input_files,
output_dir=output_dir,
model_data_dir=model_data_dir,
model_spm_path=model_spm_path,
model_checkpoint_path=model_checkpoint_path,
)
)
scorer = AttentionScoreModule(conf)
print(f"Processing {len(input_files)} files...")
print(input_files[:5])
launcher = stopes.core.Launcher(
log_folder="executor_logs",
cluster=launcher_cluster,
partition=launcher_partiton,
max_jobarray_jobs=1000,
)
shards = await launcher.schedule(scorer)
return shards


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--input-dir",
type=str,
help="directory with input tsv files (with src, mt and direction columns)",
)
parser.add_argument("--output-dir", type=str, help="directory for output files")
parser.add_argument("--model-data-dir", type=str, help="NLLB data directory")
parser.add_argument("--model-spm-path", type=str, help="NLLB tokenizer path")
parser.add_argument(
"--model-checkpoint-path", type=str, help="NLLB checkpoint path"
)
parser.add_argument(
"--launcher-cluster",
type=str,
help="launcher cluster, typically slurm or local",
)
parser.add_argument("--launcher-partiton", type=str, help="slurm partition")
args = parser.parse_args()
asyncio.run(main(**vars(args)))
Loading

0 comments on commit 27a7d18

Please sign in to comment.