Now that we have explored the space of KB-QA, we will move to neural methods for text-QA. Since text-QA generates answers based on understanding and concatenating multiple passages, this is closely link to machine reading comprehension (MRC). We will explore encoding and reasoning methods that exist in SOTA MRC models. We will then be discussing the multi-turn and conversational text-QA agents.

Introduction to Machine Reading for Text-QA

The objective of MRC is to read and understand multiple text passages and answer any question about the passages. It is the core engine of text-QA agents.

Datasets for MRC
  1. Wikipedia-based. WikiReading, SQUAD, WikiHop, DRCD

  2. News-based. CNN / Daily Mail, NewsQA, RACE, ReCoRD

  3. Fictional. MCTest, CBT, NarrativeQA

  4. Science-based. ARC

  5. General. MS MARCO, TriviaQA, SearchQA, DuReader

Figure below showcase how some of the datasets differ. SQUAD dataset involves a question and respective answer in the Wikipedia articles whereas MS MARCO dataset uses real user queries to generate answers over large amount of Web documents. It even includes questions that are unanswerable.

Neural MRC Models

As mentioned, in SQUAD, given a question and a passage, the model is required to locate an answer span by outputting the start and end of span in passage. A typical neural MRC model has three components:

  1. Encoding. Symbolic representation to neural space

  2. Reasoning. Identify the answer vector in the neural space

  3. Decoding. Converting the answer vector back to symbolic representation (natural language)

Figure below showcase two examples of SOTA MRC models: SAN and BiDAF model. We will use these two models to cover the three components in neural MRC model.


Most MRC models have three layers for encoding:

  1. Lexicon embedding layer

  2. Contextual embedding layer

  3. Attention layer

The lexicon embedding layer maps all the words (and sometimes characters) to a vector space using pretrained embeddings. You can improve word embeddings by concatenating each word embedding vector with other linguistic features such as characters, POS tags, and named entities.

Contextual embedding layer incorporates context information into the word embeddings. This allows the same word to have different embeddings depending on in which context it appears in. This is usually done using two-layer BiLSTM and we will concatenate the outputs to obtain contextually aware embeddings of all the tokens in question and passage. ELMo and BERT are the SOTA pretrained contextualised embeddings.

The attention layer is used to compute query-aware word embeddings for each word in the passage and generates the memory M for the next component of our neural MRC models, reasoning. The attention layer consists of three steps:

  1. Compute the similarity score between the query words and each passage word. This is to. Measure which passage word is most relevant to our query words and pay more attention to it (attention score)

  2. Normalised these attention scores through softmax and turn them into probabilities

  3. Compute the question-aware representation of passage by calculating the weighted sum of each passage word using the attention probabilities

Lastly, we form the working memory M by using a fusion function to combine our input matrices of question-aware representation of passage and contextualised representation of passage. Different models have different fusion function. For example, SAN’s fusion function has a concatenation layer, self-attention, and a BiLSTM layer.


There are two categories of MRC models in terms of reasoning: single-step and multi-step models. For single-step reasoning models, you only match the question and document once to generate the final answers. The goal here is to find the answer span by outputting the start and end points on the working memory M. There are three steps:

  1. Summarise the question vector

  2. Use a bilinear function and softmax to obtain the probability distribution of the start index over the passage. The input here is the summarised question vector and working memory M

  3. Use another bilinear function and softmax to obtain the probability distribution of the end index over the passage. The input here is the summarised question vector, the probability distribution of the start index, and the working memory M

For multi-step reasoning models, it has been shown to outperformed single-step reasoning models. You can either pre-determined a fixed number of reasoning steps or you could have dynamic multi-step reasoning. The latter requires reinforcement learning and has been shown to outperformed the former. SAN uses a fixed number of reasoning steps. It would generate prediction at each reasoning step and the final answer will be the average of all the predictions. Throughout the reasoning steps, the model maintains a state vector and use the previous state vector to perform reasoning.

Conversational Text-QA Agents

Again, we want to reach conversational text-QA agents where we can handle multi-turn conversations and the agents can use conversational history to provide more fine-grained answers. The conversational text-QA agents are very similar to the KB-QA one except the soft-KB lookup module is replaced with a text-QA module. The text-QA module retrieves relevant passages from the internet using search engine and feed into the MRC model to generate the final answer. The MRC model here needs to be able to refer back to conversation history.

Datasets for conversational MRC models
  1. Conversational Question Answering (CoQA)

  2. Question Answering in Context (QuAC)

These two datasets are shown in the figure below. Essentially, the MRC model needs to generate answer based on the passage P, conversation history in the form of QA pairs, and a question Q. This requires two extensions to the original components:

  1. Extend the encoding module to encode passage, answer, and conversation history

  2. Extend the reasoning module to generate answer that might not overlap with the passage



Data Scientist

Leave a Reply