Objective and Contribution
Proposed a new summarisation task of summarising novel chapters from online study guides. This is much more challenging than the news summarisation due to the length of the source document and the higher level of paraphrasing. The contributions of the paper is as follow:
Proposed a new summarisation task of summarising novel chapters
Proposed a new metric for aligning sentences in reference summary with sentences in chapter to create good quality “ground-truth” extractive summaries to train our extractive summarisation model. This has proven to improved over previous methods through ROUGE scores and pyramid analysis
We collected chapter/summary pairs from five different study guides:
We performed two rounds of filtering to process the data. Firstly, we remove any reference texts with more than 700 sentences as they are too large. Secondly, we remove summaries that are too wordy (compression ratio of less than 2). Our total final chapter / summary pairs is 8088 (6288 / 938 / 862). The training data statistics are shown below. Chapter text, on average, are 7x longer than news articles and chapter summaries are 8x longer than news summaries. In addition, for novel, the average word overlap between summary and chapter is 33.7% whereas for CNN/DailyMail news, it is 68.7%, showcasing a high level of paraphrasing within chapter summaries. This heavy paraphrasing is shown in the example reference summary below.
Since the ground-truth summaries are abstractive, we would need to create gold extractive summaries to train our extractive summarisation model. This requires us to align the sentences in the chapter and summary. To align sentences, we first need a metric to measure similarity. Previous work heavily uses ROUGE scores as a similarity metric. However, ROUGE scores assign equal weightings to each word, however, we believe that we should assign higher weight for important words. To incorporate this, we use a smooth inverse frequency weighting scheme and apply this to take the average of ROUGE-1, 2, and L, to generate extracts (R-wtd). We compared this R-wtd approach with other similarity metrics such as ROUGE-1, ROUGE-L, BERT, and unweighted and weighted ROUGE + METEOR (RM). We conducted both automatic evaluation of these similarity metrics using ROUGE-L F1 score and human evaluation. The human evaluation is required to evaluate each reference summary against the aligned sentences. The results are showcase below and R-wtd scored the highest amongst the similarity metrics.
Once we have established our similarity metric, we now explore the different alignment methods to finally generate our gold extractive summaries. There are two main methods from previous work:
Summary-level alignment. Selecting the best sentence, comparing to the summary
Sentence-level alignment. Selecting the best sentence, comparing to each sentence in the summary
For summary-level alignment, we have two variations: selecting sentences until word limit (WL) and selecting sentences until ROUGE score no longer increases (WS summary). For sentence-level alignment, we have two variations: the Gale-Shapley stable matching algorithm and greedy algorithm. The results are displayed below and showcase that the sentence-level stable algorithm performed significantly better than other alignment methods.
Experiments and Results
For evaluation, we have three extractive models:
Hierarchical CNN-LSTM (CB)
Seq2seq with attention (K)
We experiment with alignment methods applied at both the word and constituent level since our data analysis shows that summary sentences are often selected from different chapters. Our evaluation metrics is ROUGE-1, 2, L, and METEOR. Each chapter has 2 – 5 reference summaries and we evaluate our generated summaries against all of them.
The results above compared the performance of three different extractive models as well as the performance difference of using different alignment methods. We can see that our proposed alignment method outperformed the baseline method in all three extractive models. All three models seem to perform similarly using our extractive targets, suggesting the importance of selecting the appropriate method to generate extractive targets. Given the unreliability of ROUGE, we perform human evaluation and compute the pyramid score of each alignment methods on our best performing model (CB). The crowd workers are asked to identify which generated summary best convey the sampled reference summary content. The results are displayed below.
Conclusion and Future Work
We have shown that sentence-level, stable-matched alignment method with R-wtd similarity metric performed better than previous method of computing gold extractive summaries. However, there seem to be a contradictory in automatic and human evaluation on whether extraction is better at the sentence or constituent level. We speculate that this might be because we didn’t include the additional context when scoring the summaries of extracted constituents and so the irrelevant context didn’t go against the system whereas in the human evaluation, we do include sentence context and so fewer constituents are included in the generated summary.
In future work, we plan on examining how we can combine constituents to make fluent sentences without including irrelevant context. We would also like to explore abstractive summarisation, to examine if language models would be effective in our domain. This could be challenging as language models typically has a limit of 512 tokens. The truncation of our documents might hurt the performance of our novel chapter summarisation model.