Code and data for the ACL 2022 paper: https://arxiv.org/abs/2110.07198
* Python=3.6
* Pytorch>=1.10.1
* Huggingface Transformers=4.13
- All models require train, dev and test files in pickle format as input. The specific format is:
- The pickle file should be a list of dictionaries. Each dictionary has two keys,
pos
andnegs
(orneg
for pairwise data). pos
should contain the list of sentences from the positive or coherent documentnegs
should contain the list of negative documents (e.g. incoherent documents, permutations) which are in turn lists of sentences.- The
--data_type
argument should be set tosingle
ormultiple
depending on the number of negatives in the dataset.- e.g.
single
: [{'pos':['sentence_1', 'sentence_2', 'sentence_3'], 'neg':['sentence_2', 'sentence_1', 'sentence_3']}..{}] - e.g.
multiple
: [{'pos':['sentence_1', 'sentence_2', 'sentence_3'], 'negs':[['sentence_3', 'sentence_2', 'sentence_1'], ['sentence_1', 'sentence_3', 'sentence_2'], ['sentence_2', 'sentence_3', 'sentence_1']]}..{}]
- e.g.
- The pickle file should be a list of dictionaries. Each dictionary has two keys,
You can look at the test sets provided in the independent_test_sets
folder for the format.
Download the pretrained model from: https://www.dropbox.com/sh/2q5s71zxc3o0tp6/AAA1TXbdR_xVBNSKXDkahFqma?dl=0. Navigate into the model folder you want to evaluate with (ensure this matches the pretrained model you downloaded). Run:
> CUDA_VISIBLE_DEVICES=x python eval.py --test_file [test.pkl] --data_type [single,multiple] --pretrained_model [model.pt]
The code will provide the accuracy calculated as the number of times the positive document was scored higher than the negative document. If you are comparing generated texts from two models, you can assign any of the models' outputs as pos
and neg
consistently and obtain the percentage of times the model designated pos
is more coherent than the model designated neg
. Simply subtract the accuracy from 100 to get the vice-versa value.
You can also pass any of the test files in the independent_test_sets
folder or provide your own test set based on the format described above. In case the comparison is not so straight-forward (for e.g., the LMvLM
dataset), the code also saves the scores in a pickle dump called temp-eval-dump
.
To evaluate the Krippendorff's alpha agreement for the LMvLM
dataset, run the test set to save the scores first, and then run:
> python model_agreement.py LMvLM_Annotations.pkl temp-eval-dump
Navigate into the model folder that you want to train (pairwise, contrastive or our full hard negative model with momentum encoder). You can download the formatted INSteD training and dev sets for both the original intrusion and the permuted document tasks here: https://www.dropbox.com/sh/3zoxwzdt0il8x48/AABMPdIriqJp_-0gHw0k2dHfa?dl=0. Note that the WSJ dataset requires an LDC license, and we will be happy to share the WSJ permuted document dataset with you if you have it (please email the authors).
> CUDA_VISIBLE_DEVICES=x python train.py --train_file [train.pkl] --dev_file [dev.pkl]
Please refer to the args.py
file for all other arguments that can be set. All hyperparameter defaults are set to the values used for experiments in the paper.
To evaluate the model on a test set, run
> CUDA_VISIBLE_DEVICES=x python eval.py --test_file [test.pkl] --data_type [single,multiple] --pretrained_model [saved_checkpoint.pt]