Continue on learning about the machine reading comprehension space. Today I learned about the different datasets for each MRC tasks outlined in previous blog post. I also learned about the different evaluation metrics used in MRC.

MRC Datasets

Cloze Tests Datasets

  • CNN & Daily Mail
    • Evaluate the machine reading system by asking the machine to read the documents and then predict which entity the placeholder in the bullet points referred to.

  • Children’s Book Test (CBT)
    • Unlike CNN/DM, which are only limited to named entities, CBT has 4 types of missing items: named entities, nouns, verbs, and prepositions

    • The task is word prediction. Need to predict the last word in the target sentence. Compared to CBT, LAMBADA requires more understanding of the wider context as LAMBADA makes it difficult to predict the target word correctly with just the target sentence

  • Who-did-What
    • An attempt to reduce syntactic similarity between questions and documents (avoid overlap). In the dataset, each sample is formed from two independent articles; one serves as the context and  the other is used to generate questions. Only focuses on personal name entity (limitation)

    • Contrast to above, this dataset is human-created by teachers! These are English exams and are good to examine vocabulary, reasoning, and grammar

  • CliCR
    • Healthcare and medicine related. A large-scale cloze-style dataset that’s domain specific. Format of dataset is similar to CNN/DM. This dataset promotes practical applications such as clinical decision-making

Multiple-Choice Datasets

  • MCTest
    • 500 fictional stories. Each story has 4 questions with 4 candidate answers. Fictional stories is chosen to avoid the need or use of external knowledge. This dataset is what inspired CBT and LAMBADA datasets

  • RACE
    • Similar to CLOTH. This dataset has a great variety of types of passages unlike CNN/DM or CBT. RACE requires more reasoning as questions and answers are human-generated. Large-scale dataset that supports training of deep learning models

Span Extraction Datasets

  • SQuAD
    • Large dataset of 100,000+ questions, with answers selected from articles. High quality. A milestone of MRC and the MRC competition has led to many research work from academia and industry

  • NewsQA
    • Similar to SQuAD except source of articles are from CNN which SQuAD is based on Wikipedia. A key distinction is that some questions in NewsQA have no answer according to the given context which makes it more challenging and closer to reality. Led to development of SQuAD v2.0 to include non-answerable questions

  • TriviaQA
    • The uniqueness of this dataset lies in the construction methodology. Previous methods of dataset creation has strong dependencies between questions and evidence to answer them as questions are created from given articles. TiviaQA is created by gathering question-answer pairs from trivia and quiz-league sites. They then search evidence to answer these questions from other websites and Wikipedia. 650,000+ question-answer-evidence triples. A high syntactic variability between questions and contexts

  • DuoRC
    • Similar to Who-did-What dataset, questions and answers are generated from two different documents related to the same movie, one from Wikipedia and one from IMDb. The questions and answers creation are done by different group of crowd-workers. The dataset also has unanswerable questions

Free Answering Datasets

  • bAbI
    • Synthetic dataset. 20 tasks with a simulation of a classic text adventure game. Each task is independent and tests one aspect of text understanding and uses basic deduction and induction

    • Another milestone in MRC after SQuAD. 4 strong features. Firstly, all questions are collected from real user queries. Secondly, each question has 10 related documents (from Bing search) as the context. Thirdly, labelled answers are generated by humans and have no restriction to spans of context. Fourth, there are multiple answers to each question and sometimes they even conflict

  • SearchQA
    • Very similar to TriviaQA. Major difference is that TriviaQA only has one document with evidence per QA pair whereas SearchQA has, on average, 49.6 related snippets per pair

  • NarrativeQA
    • Based on book stories and movie scripts, they search related summaries from Wikipedia and generate QA pairs according to the summaries

  • DuReader
    • Similar to MS MARCO. Questions and documents are collected from Baidu Search and Baidu Zhidao (QA community). Answers are human-generated. DuReader differs in that ir provides new question types such as yes/no and opinion. These questions sometimes require summaries over multiple documents

  • HotpotQA

Evaluation Metrics for MRC

  • Accuracy
    • Accuracy w.r.t ground truth answers
      • Common to evaluate cloze tests and multiple-choice tasks

    • Exact match (variant of accuracy)
      • Evaluate whether a predicted answer span matches the ground-truth sequence exactly or not

  • F1 Score
    • Span extraction task

    • In terms of MRC, both candidate and reference answers are treated as bags of tokens

    • This metric loosely measures the average overlap between the prediction and the ground-truth answer

    • Free answering task

    • Measures the longest common subsequence between the gold answers and predicted answers. The more overlap, the higher ROUGE-L scores is

  • BLEU
    • Free answering task

    • Measures the similarity between predicted answers and ground truth and also test readability of candidates



Data Scientist

Leave a Reply