This repository contains the code and pre-trained models for DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation (ACL2023 Findings).
- 0. Preparation
- 1. How to use DEnsity for Evaluation?
- 2. How to Train DEnsity from Scratch?
- 3. How to Reproduce the Paper Result?
1. Requirements
torch==1.7.1
transformers==4.12.3
scikit-learn
scipy
wget
2. Download pre-trained models
Pretrained models (trained on DailyDialog and ConvAI2): Link
Locate the downloaded files as below:
DEnsity/
logs/
dd/
reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/
models/
bestmodel.pth
convai2/
reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/
models/
bestmodel.pth
results/
pickle_save_path/
dd/
maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck
convai2/
maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck
from evaluators.model import DEnsity
from utils.utils import load_tokenizer_and_reranker
lm_name = 'bert-base-uncased'
model_path = './logs/dd/reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/models/bestmodel.pth'
mean_cov_pck_fname = "./results/pickle_save_path/dd/maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck"
tokenizer, model = load_tokenizer_and_reranker(lm_name, model_path)
evaluator = DEnsity(None,mean_cov_pck_fname,tokenizer,model)
conversation = ["How are you?", "I'm fine, thank you!", "That's great!!!!"]
turn_level_score = evaluator.evaluate(conversation, is_turn_level=True) # -498.25882
dialogue_level_score = evaluator.evaluate(conversation, is_turn_level=False) # -352.70813
Below procedure is an example of training our feature extractor (i.e. response selection model) on DailyDialog dataset.
1. Dataset Preparation
Download DailyDialog dataset and locate it as below.
DEnsity/
data/
dd/
train/
dialogues_train.txt
validation/
dialogues_validation.txt
2. Run Training
source scripts/train.sh
3. How to train on new datasets other than DailyDialog and ConvAi2?
Make a new dataset class (e.g., MyDatasetforSelection(SelectionDataset)
). You can refer to ConvAI2forSelection()
or ConvAI2forSelection()
classes in utils/dataset_util.py
.
1. Preprocessing evaluation dataset
# DailyDialogue-Zhao
# Download human annotation file from [here](https://drive.google.com/drive/folders/1Y0Gzvxas3lukmTBdAI6cVC4qJ5QM0LBt) to `data/evaluation/dd/dd_annotations.json`.
python preprocess/preprocess_dd_zhao_annotation.py
# GRADE-DailyDialog and GRADE-ConvAI2
# Download human annotation file fore [here](https://github.com/li3cmz/GRADE/tree/main/evaluation).
python preprocess/preprocess_grade_annotation.py
# USR-ConvAI2
python preprocess/preprocess_usr_annotation.py
# Dialogue-level FED
preprocess/preprocess_fed_dialogue.ipynb # Run notebook
2. Run Evaluation
Main Results (Turn-level evalatuion in Table 1 of the paper)
source scripts/test.sh
To reproduce the results of dialogue-level evaluation with FED datset, please use the python code.
from tqdm import tqdm
from evaluators.model import DEnsity
from evaluators.eval_utils import get_correlation
from utils.utils import load_tokenizer_and_reranker, read_jsonl
# Load model
lm_name = "bert-base-uncased"
model_path = "./logs/dd/reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/models/bestmodel.pth"
mean_cov_pck_fname = "./results/pickle_save_path/dd/maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck"
tokenizer, model = load_tokenizer_and_reranker(lm_name, model_path)
evaluator = DEnsity(None, mean_cov_pck_fname, tokenizer, model)
# Read datset
fed_fname = "./data/evaluation/fed_dialogue_overall/test_processed.json"
fed_examples = read_jsonl(fed_fname)
# Run evaluation
model_scores = []
human_scores = []
tokenizer.truncation_side = "left"
for _, el in enumerate(tqdm(fed_examples)):
uttrs, score = el["history"], el["score"]
density_score = evaluator.evaluate(uttrs, is_turn_level=False)
model_scores.append(density_score)
human_scores.append(score)
print(round(100 * get_correlation(human_scores, model_scores)["spearman-value"], 2)) # 43.34