SQuAD 1.0 & 2.0 |
https://rajpurkar.github.io/SQuAD-explorer/ |
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. |
Who-Did-What |
https://tticnlp.github.io/who_did_what/leaderBoard.html |
We have constructed a new "Who-did-What" dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. |
MS MARCO |
http://www.msmarco.org/ |
Microsoft MAchine Reading COmprehension Dataset |
RACE |
http://www.qizhexie.com/data/RACE_leaderboard |
The RACE dataset is a large-scale ReAding Comprehension dataset collected from English Examinations that are created for middle school and high school students. |
Movie QA |
http://movieqa.cs.toronto.edu/leaderboard/ |
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text.Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS. |
Hotpot QA |
https://hotpotqa.github.io/ |
A Dataset for Diverse, Explainable Multi-hop Question Answering.HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. |
CoQA |
https://stanfordnlp.github.io/coqa/ |
A Conversational Question Answering Challenge.HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. |
DREAM |
https://dataset.org/dream/ |
A Challenge Dataset and Models for Dialogue-Based Reading Comprehension.DREAM is a multiple-choice Dialogue-based REAding comprehension exaMination dataset. In contrast to existing reading comprehension datasets, DREAM is the first to focus on in-depth multi-turn multi-party dialogue understanding. |
QuAC |
http://quac.ai/ |
Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context. |
ReCoRD |
https://sheng-z.github.io/ReCoRD-explorer/ |
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd]. |
Cosmos QA |
https://wilburone.github.io/cosmos/ |
Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context. |
Natural Questions |
https://ai.google.com/research/NaturalQuestions |
To help spur development in open-domain question answering, we have created the Natural Questions (NQ) corpus, along with a challenge website based on this data. The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets. |
DROP |
https://allennlp.org/drop |
With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard dataset that improves systems' capabilities to actually read paragraphs of text. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. |