There are 4 different MRC tasks!
The formal definition of MRC is:
Given the context C and question Q, machine reading comprehension (MRC) tasks ask the model to give the correct answer A to the question Q by learning the function F such that A = F(C, Q).
We can categorise MRC into 4 different tasks (mainly based on the complexity of answer the model is required to generate):
- Cloze Tests
- Multiple Choice
- Span Extraction
- Free Answering
Also known as gap-filling tests. They are commonly used in exams to evaluate students’ language proficiency. Questions are generated by removing words or entities from a passage. To answer questions, one is asked to fill in the blank. Cloze tests require understanding of context as well as selecting the right word or entity from vocabulary.
Self-explanatory. It requires the model to select the correct answer (from a list of answers) to the question given the provided context.
Unlike cloze tests or multiple choice, span extraction task requires the machine to extract a span of text (continuous subsequence) from the given context as the answer to the question.
The most complicated task. To answer questions, the machine needs to comprehend and reason across multiple span of texts and summarise the evidence. With free answering task, there are no limitations to the answer forms. It focuses on using free-form text to answer questions.
Comparison of different tasks!
The paper compared the 4 MRC tasks across 5 dimensions:
- Construction – How easy it is to construct datasets for the task. The easier, the higher score.
- Understanding – How well the task evaluate the machine’s understanding ability. If the task requires more understanding, the score is higher.
- Flexibility – Flexibility in the answer form. The more flexible, the higher the score.
- Evaluation – How easy can the task be evaluated. The easier, the higher the score.
- Application – If a task can easily be applied to the real world, the score is high.
Score is between 1 and 4, with 4 being the best and 1 being the worst. Figure below is taken from the paper (in the source) and it showcase how each task scores in the different dimensions. Cloze tests doesn’t seem to be able to test machine understanding well and doesn’t apply to real-world applications due to answer form is restricted to single word or entities. Multiple choice can easily be evaluated and it’s not very difficult to build the dataset. Span extraction tasks score moderately in all 5 dimensions which explains why a lot of research is focusing on this task. SQuAD is a popular span extraction dataset. However, the answers are constrained to only subsequence of original text, which is far from real-world applications. Finally, free answering tasks dominate in understanding, flexibility, and application dimensions as it’s very close to real-world applications. However, due to flexibility in the answer form, it makes building the dataset and evaluation extremely challenging.