Objective and Contribution
Proposed an extractive summarisation model with question-answering rewards as we believe that informative summaries should include answers to important questions. Our generated summaries yielded competitive results as measured by automatic metrics and human assessors.
Our proposed model can identify and highlight phrases that are important for answering our questions as shown below.
The contributions of the paper are:
Proposed a novel framework of selecting consecutive words from source documents to generate extractive summaries. This involves new encoding mechanisms and sampling techniques
Performed empirical evaluation on information saliency by assessing summary quality with reading comprehension tasks
Our approach is broken into four different components:
Constructing extractive summary
Answering questions using the summary
We experimented with words or chunks (phrases) as extraction units. We obtain text chunks using the sentence constituent parse tree and each chunk has at most 5 words. Note that we did not experiment with sentence level extraction like most existing work. Instead, we focused on finer-grained extraction units. We experimented with CNN and biLSTM to encode these extraction units.
Constructing Extractive Summary
We need to identify text segments from source articles to form our extractive summary and this can be seen as a sequence labelling problem. We decided to use the framework whereby the importance of the t-th source extraction unit is determined by its informativeness, its position in the document, and the relationship with the previously selected extraction units. We have positional embeddings to encode the position of the extraction unit. At each time step, we build the vector representation of our summary up to time t – 1 and used it along with positional embeddings and our encoded hidden states to determine whether we should include the new extraction unit. The architecture for this is an unidirectional LSTM as shown below.
Answering Questions using summaries
To create question answer (QA) pairs, we limit our answer token to either be a salient word or named entity. We identify salient word or named entity in all the sentences in the human abstract and replace the answer token with a blank to create Cloze-style QA pair. Note that at least one QA pair should be extracted from each sentence of the abstract so that our summary includes all the useful content to answer all the questions. Overall we know have a set of QA pairs extracted from the human abstract and we can train our LSTM and attention mechanism to answer these questions using the source document.
A Reinforcement Learning Framework
Here, we derived a reward function that encourages our models to produce adequate, fluent, concise, and competent summaries that can perform well in our QA tasks. Our reward function has four components:
QA competency. Average log likelihood of correctly answering questions using generated summary
Adequacy. Percentage of overlap unigrams between generated and reference summary
Fluency. Encourages consecutive sequence of words to be selected
Length. Limit the summary size by setting a threshold
Our evaluation dataset is the CNN / Daily Mail where 83% and 45% of summary unigrams and bigrams appear in source articles. We restricted the article length to 400 words and associate each article to at most 10 QA pairs to guide the extraction of summary segments. Our evaluation metric is the ROUGE scores.
We compared our model with different non-neural, extractive, and abstractive models. The models include:
Hierarchical with attention neural network (word and sentence based – WE and SE)
PG network + coverage
We experimented with different variants of our methods. We have a baseline variant where we didn’t use QA pairs during training and three other variants that uses different types of QA pairs, for example, the answer token is the SUBJ/OBJ or NER. The table 2 and 3 below showcase the results and we observed that our QASumm with different QA pairs yielded competitive results among the baseline models and outperformed the QASumm with no QA pairs variant. Our model performed at a comparable level against most SOTA results but underperformed the PG network with coverage.
We believe that extracting summary chunks rather than sentence level is key to building a concise summary but it does makes the summarisation task more challenging as the search space is larger. We also observed that the ROOT-type QA pairs have the least number of unique answers. Our QASumm + ROOT performed the best amongst the variant in daily mail dataset and QASumm + NER performed the best in CNN dataset. We suspect that maintaining a good number of unique answers is important to maximise performance.
In theory, an informative summary should have a high QA accuracy. We compare the summaries generated from QASumm + NoQ, the gold-standard summaries (GoldSumm), NoText (no source article), and FullText (full source article). The results are displayed below. We observed that QA with GoldSumm performed the best for all QA types, which includes FullText. This means that a highly informative summary is more useful in answering questions as searching for answers in a concise summary can be more efficient. We found that ROOT-type QA pairs can achieve high QA accuracy with NoText input which suggests that ROOT answers can be predicted using the question context. On the other hand, the NER-type QA pairs work best for FullText which most likely due to source texts containing the necessary entities to answer the questions. Therefore, we would suggest future work to include NER-based QA pairs as they encourage summaries to contain important information from the source.
We want to find out whether words or chunks are better as the extraction units. We compared the performance of our LSTM and CNN encoder and found that chunks with LSTM performed the best and chunks with CNN outperformed LSTM and CNN with words.
Each participant is given the document and three fill-in-the-blank questions. The answer tokens is chosen randomly and can be root word, the subj/obj word, or NER word. We asked the participants to rate the informativeness of the summary from 1 – 5, 5 being the most informative. We evaluated the summaries from our models and PG network. The table below showcase the average time it takes to complete a single question, the overall accuracy, and the informativeness score. Excluding human performance, our QASumm with NER-type QA pairs was able to achieved the highest accuracy and informativeness. We found that our best performing model has a wide margin in QA accuracy despite similar level of informativeness score.
Conclusion and Future Work
Our deep reinforcement learning uses a reward function (that encourages adequate and fluent summaries) to extract consecutive word sequences from source document to form extractive summary.