Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System
This repository contains the PyTorch implementation and the data of the paper: Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System. Libo Qin, Tianbao Xie, Shijue Huang, Qiguang Chen, Xiao Xu, Wanxiang Che. EMNLP2021.[PDF] .
This code has been written using PyTorch >= 1.1. If you use any source codes or the datasets included in this toolkit in your work, please cite the following paper. The bibtex are listed below:
@misc{qin2021dont, title={Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System}, author={Libo Qin and Tianbao Xie and Shijue Huang and Qiguang Chen and Xiao Xu and Wanxiang Che}, year={2021}, eprint={2109.11292}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Consistency Identification has obtained remarkable success on open-domain dialogue, which can be used for preventing inconsistent response generation. However, in contrast to the rapid development in open-domain dialogue, few efforts have been made to the task-oriented dialogue direction. In this paper, we argue that consistency problem is more urgent in task-oriented domain. To facilitate the research, we introduce CI-ToD, a novel dataset for Consistency Identification in Task-oriented Dialog system. In addition, we not only annotate the single label to enable the model to judge whether the system response is contradictory, but also provide more finegrained labels (i.e., Dialogue History Inconsistency(HI), User Query Inconsistency(QI) and Knowledge Base Inconsistency(KBI), which are as shown in the figure below) to encourage model to know what inconsistent sources lead to it. Empirical results show that state-of-the-art methods only achieve performance of 51.3%, which is far behind the human performance of 93.2%, indicating that there is ample room for improving consistency identification ability. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future directions.
We construct the CI-ToD dataset based on the KVRET dataset. We release our dataset together with the code, you can find it under data
.
The basic format of the dataset is as follows, including multiple rounds of dialogue, knowledge base and related inconsistency annotations (KBI, QI, HI):
[
{
"id": 74,
"dialogue": [
{
"turn": "driver",
"utterance": "i need to find out the date and time for my swimming_activity"
},
{
"turn": "assistant",
"utterance": "i have two which one i have one for the_14th at 6pm and one for the_12th at 7pm"
}
],
"scenario": {
"kb": {
"items": [
{
"date": "the_11th",
"time": "9am",
"event": "tennis_activity",
"agenda": "-",
"room": "-",
"party": "father"
},
{
"date": "the_18th",
"time": "2pm",
"event": "football_activity",
"agenda": "-",
"room": "-",
"party": "martha"
},
.......
]
},
"qi": "0",
"hi": "0",
"kbi": "0"
},
"HIPosition": []
}
Dataset | QI | HI | KBI | SUM |
---|---|---|---|---|
calendar_train.json | 174 | 56 | 177 | 595 |
calendar_dev.json | 28 | 9 | 24 | 74 |
calendar_test.json | 23 | 8 | 21 | 74 |
navigate_train.json | 453 | 386 | 591 | 1110 |
navigate_dev.json | 55 | 41 | 69 | 139 |
navigate_test.json | 48 | 44 | 71 | 138 |
weather_new_train.json | 631 | 132 | 551 | 848 |
weather_new_dev.json | 81 | 14 | 66 | 106 |
weather_new_test.json | 72 | 12 | 69 | 106 |
Here is the model structure of non pre-trained model (a) and pre-trained model (b and c).
we provide some pre-trained baselines on our proposed CI-TOD dataset, the packages we used are listed follow:
-- scikit-learn==0.23.2
-- numpy=1.19.1
-- pytorch=1.1.0
-- fitlog==0.9.13
-- tqdm=4.49.0
-- sklearn==0.0
-- transformers==3.2.0
We highly suggest you using Anaconda to manage your python environment. If so, you can run the following command directly on the terminal to create the environment:
conda env create -f py3.6pytorch1.1_.yaml
The script train.py acts as a main function to the project, you can run the experiments by the following commands:
python -u train.py --cfg KBRetriver_DC/KBRetriver_DC_BERT.cfg
The parameters we use are configured in the configure
. If you need to adjust them, you can modify them in the relevant files or append parameters to the command.
Finally, you can check the results in logs
folder.Also, you can run fitlog command to visualize the results:
fitlog log logs/
All experiments were performed in TITAN_XP except for BART, which was performed on Tesla V100 PCIE 32 GB. These may not be the best results. Therefore, the parameters can be adjusted to obtain better results.
Baseline category | Baseline method | QI F1 | HI F1 | KBI F1 | Overall Acc |
---|---|---|---|---|---|
Non Pre-trained Model | ESIM (Chen et al., 2017) | 0.512 | 0.164 | 0.543 | 0.432 |
Infersent (Romanov and Shivade, 2018) | 0.557 | 0.031 | 0.336 | 0.356 | |
RE2 (Yang et al., 2019) | 0.655 | 0.244 | 0.739 | 0.481 | |
Pre-trained Model | BERT (Devlin et al., 2019) | 0.691 | 0.555 | 0.740 | 0.500 |
RoBERTa (Liu et al., 2019) | 0.715 | 0.472 | 0.715 | 0.500 | |
XLNet (Yang et al., 2020) | 0.725 | 0.487 | 0.736 | 0.509 | |
Longformer (Beltagy et al., 2020) | 0.717 | 0.500 | 0.710 | 0.497 | |
BART (Lewis et al., 2020) | 0.744 | 0.510 | 0.761 | 0.513 | |
Human | Human Performance | 0.962 | 0.805 | 0.920 | 0.932 |
If you submit papers with these datasets, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained purely by training on the training datasets and tuned on the dev datasets (e.g. you only evaluted on the test set once).
Baseline method | QI F1 | HI F1 | KBI F1 | Overall Acc |
---|---|---|---|---|
ESIM (Chen et al., 2017) | 0.512 | 0.164 | 0.543 | 0.432 |
Infersent (Romanov and Shivade, 2018) | 0.557 | 0.031 | 0.336 | 0.356 |
RE2 (Yang et al., 2019) | 0.655 | 0.244 | 0.739 | 0.481 |
BERT (Devlin et al., 2019) | 0.691 | 0.555 | 0.740 | 0.500 |
RoBERTa (Liu et al., 2019) | 0.715 | 0.472 | 0.715 | 0.500 |
XLNet (Yang et al., 2020) | 0.725 | 0.487 | 0.736 | 0.509 |
Longformer (Beltagy et al., 2020) | 0.717 | 0.500 | 0.710 | 0.497 |
BART (Lewis et al., 2020) | 0.744 | 0.510 | 0.761 | 0.513 |
Human Performance | 0.962 | 0.805 | 0.920 | 0.932 |
Thanks for patient annotation from all taggers Lehan Wang, Ran Duan, Fuxuan Wei, Yudi Zhang, Weiyun Wang!
Thanks for supports and guidance from our adviser Wanxiang Che!