Objective and Contribution
Proposed Question Answering and Generation for Summarisation (QAGS), an automatic evaluation method that is designed to identify factual inconsistencies in a generated summary. QAGS uses question answering on both the source and summary and a factual consistent summary should be able to produce similar answers as the source. We compared QAGS with human judgements and found that there are high correlations. Lastly, QAGS provides interpretability as to which part of the summary are factual inconsistent using the answers and questions generated.
QAGS framework consists of 3 steps:
Question generation model to generate questions based on generated summary
Question answering models to answer the questions using both the source document and generated summary
Answer similarity is computed based on how similar the answers are
We train a seq2seq model to generate questions based on both the answer and source article. We over-sample questions and use different methods to filter out low-quality questions such as removing duplicates and questions with three tokens or less. We also feed the questions into the QA model and remove questions that are predicted with no answer. And so, in this step, we generated K questions based on the summary.
Here, we have an extractive QA models to extract the answers as text spans from the source document and summary. Future work could experiment with abstractive QA models. In this step, we answer those generated questions using both the source and summary to obtain two sets of answers.
Here, we have a simple token-level F1 score to compare the answers and measure answer similarity. In this final step, we compare the answers using the similarity metric and averaging the answer similarity score over all questions.
We have two evaluation datasets: CNN/DM and XSUM. We measured the correlations between QAGS and human judgements of factual consistency. For each summary, we collected 3 annotations and obtain a single correctness score per summary by taking the majority vote for each sentence and averaging the binary scores across summary sentences. We compared our QAGS metric with other common summarisation metrics such as ROUGE, METEOR, BLEU, and BERTScore.
The table below showcase the correlation between different evaluation metrics and human judgements of factual consistency. We show that QAGS achieved the highest correlation by a substantial margin. QAGS performed 2x better than the next best performing metric. QAGS scored significantly lower in XSUM but still outperformed other metrics by a wide margin. This showcase that XSUM dataset is more abstractive.
We use different models for different steps in our framework. We explore how the quality of these models would affect our evaluation capabilities. For our QA model, we train and fine-tune different version of BERT on SQUAD. The results are showcase below. We show that the best QA model with the highest F1 score does not mean higher correlation with human judgements. In both CNN/DM and XSUM, bert-base QA model achieved the highest correlations with human judgements despite scoring the lowest F1 score.
For our QG model, we use models with increasing perplexity on the NewsQA dataset. The results are showcase below. We show that QAGS is robust to the quality of QG model as we see no clear trend of higher quality QG model leads to higher correlation with human judgements.
The QAGS framework requires labelled data to train both QG and QA models. This might be effective in a data rich domain but in niche domains, we might not have access to labelled data. In those situations, we are forced to use out-of-domain data to train our models which may negatively impact our QAGS quality due to domain shift. We assess the impact of this domain shift by training our QG model using SQUAD which it’s a collection of wikipedia articles rather than CNN articles. The new correlations score with SQUAD-QG model is 51.53 and 15.28 on CNN/DM and XSUM dataset respectively. This is lower than the correlation scores when using NewsQA-QG model but it still significantly outperformed other evaluation metrics.
Number of questions
Lastly, we explore how the number of questions would affect the correlation with human judgements. The results are showcase below and it shows that as the number of questions increase, we see a consistent increase in correlation scores in both evaluation datasets. We also observed that a) with only 5 questions, we are able to achieve correlations higher than other evaluation metrics and b) there is only a small increase in correlation scores when increasing number of questions from 20 to 50, showcasing decreasing marginal benefit of including more than 50 questions.
Answer similarity metrics
There are many methods to measure similarity between two answers. An alternative to token-level F1 score, we can use exact match (EM) which it’s more restrictive. With EM, we obtain correlation scores of 45.97 and 18.10 on CNN/DM and XSUM. Different answer similarity metrics are open for exploration.
Re-ranking with QAGS
We explore the sentence ranking experiment where we have around 400 triplets, consisting of source sentences from CNN/DM and two generated summary sentences, where one is factually consistent and the other is inconsistent. We used QAGS and other baseline comparisons to measure how often it ranks the consistent sentence higher than the inconsistent sentence. The results are displayed below and QAGS outperformed all previous NLI methods.
The QAGS framework provides high interpretability as the questions and answers generated allow us to directly highlight errors in summaries. The figure below showcase example of questions and answers generated. As we can see, by using questions and answers, we can detect several factual inconsistencies in our generated summary. For example, the attacker’s name is Usman Khan but was changed to Faisal Khan in the summary. Our QG model can generate appropriate questions and our QA model focuses mainly on named entities and noun phrases. In the future, we can expand the answer candidates, allowing us to detect different kinds of errors. The second example showcase the weakness of QAGS where sometimes, different answers are correct but might not have common tokens, resulting in false error.
To determine the quality of our generated questions, article and summary answers, we manually annotated 400 triplets on the XSUM summaries and label them by their quality. We found that 8.75% of generated questions are nonsense and 3% are well-formed but couldn’t be answer by the generated summary. This shows that a large proportion of our generated questions are easy to understanding, meaningful, and relatable. 8.25% of questions are well-formed but couldn’t be answer by the source document, largely due to non-sensical facts that QG model turns into questions.
We have a large 32.50% of incorrectly answers using the source article, indicating that our QA model is weak. Finally, 8% of questions are answered correctly using both source article and summary but due to little or no overlap in tokens, it was identify as incorrect.
Conclusion and Future Work
Potential future work could be to improve the question answering models, apply the metrics to different types of data or fields such as translation and image captioning.