Objective and Contribution
Proposed SUPERT, an unsupervised evaluation metric for evaluating multi-document summary by measuring the semantic similarity between the summary and the pseudo reference summary. The pseudo reference summary is generated by selecting salient sentences from the source documents using contextualised embeddings and soft token alignment. SUPERT was able to achieve a better correlation with human evaluation of 18 – 39%. We used SUPERT with an reinforcement learning summariser and it yielded a strong performance in comparison to SOTA unsupervised summarisers. This showcase the effectiveness of SUPERT and it also means that we can create many reference summaries from the infinite number of documents to increase size of dataset.
Datasets and Evaluation Metrics
We used two multi-document summarisation datasets: TAC’08 and TAC’09. Both TAC datasets consist of roughly 45+ topics and each topic has ten news articles, four reference summaries and 55+ machine-generated summaries. Our evaluation metrics are three different correlation coefficients: Pearson’s, Spearman’s, and Kendall’s.
JS divergence. Measures the JS divergence between word distributions in source and summaries
Cosine-ELMo. Contextualised word embeddings
ROUGE-1 and ROUGE-2 and MoverScore. Upper bounds performance measure
SUmmarisation evaluation with Pseudo references and bERT (SUPERT)
SUPERT measures the relevance of multi-document summaries, which measures how much salient information is included in the summary from the source document. We measure relevance in two steps:
Identify salient sentences from the source document
Measuring the semantic overlap between the pseudo reference (step 1) and the generated summary
The results table below showcase how all the baseline methods performed significantly below the upper bound performance limit. Surprisingly, the embedding-based methods performed worse than the lexicon-based methods. This tells us that existing single document evaluation metrics are ineffective in evaluating multi-document summaries.
Measuring Similarity with contextualised embeddings
We extended cosine-ELMo by exploring different text encoders such as BERT, ROBERTa, ALBERT and SBERT with cosine similarity. The results are displayed below. As shown, SBERT as the text encoder with cosine similarity yielded the highest relevance generated summaries. However, this still performed poorly against the lexicon-based methods. Another extension we explored is the use of word mover’s distances (WMDs) to measure semantic similarity between two documents instead of using cosine similarity. Previous work has proven that WMDs yielded a stronger performance and our results below supported that as WMD with SBERT (M_SBERT) significantly outperformed its cosine similarity counterparts and all the lexicon-based methods. This led us to our ultimate method for computing semantic similarity between documents, which it’s to use SBERT and WMD.
Building Pseudo References
Results from previous tables showcase a large difference in performance between unsupervised evaluation and reference-based evaluation. This argues that we still need reference summaries and so we explore different methods of building pseudo references.
Firstly, we explored two simple strategies to establish baseline results: choose N random sentences or top N sentences. The results are displayed below. The results showcase the poor performance of randomly selected sentences and we should be selecting the top 10 – 15 sentences as pseudo references as it outperformed the lexical-based and our M_SBERT methods. This also illustrate the position bias in news articles.
Secondly, we explored two graph-based approach to building pseudo references: position-agnostic and position-aware graphs. For position-agnostic graphs, we extended LexRank using SBERT (SLR) to measure the cosine similarity. We also explore the affinity propagation clustering algorithm (SC) which clusters the sentences and the center of each cluster is selected to build pseudo reference. This clustering algorithm doesn’t require us to preset the number of clusters. For SLR and SC, we have two variations: individual graph and global graph. The individual graph builds a graph for each source document and selects top K sentences. The global graph builds a graph using all the sentences from all the source documents of the same topic and selects the top M sentences.
For position-aware graphs, we extended PacSum using SBERT (SPS) to measure sentences similarity and similarly, consider both individual and global-graph versions. PacSum selects sentences that are semantically central meaning it has high average similarity with succeeding sentences and low average similarity with preceding sentences. In addition, we also proposed Top + Clique (TC), which selects top N sentences and semantically central sentences to build pseudo references. Here’s how TC works:
Label top N sentences from each document as salient
Build a graph that connects highly similar non-top-N sentences
Identify the cliques from the graph and select the semantically central sentence from each clique as potential salient sentences
For each potential salient sentence, compare it to the top N sentences and label it as salient if it’s not highly similar to the top N sentences
The table below showcase the results of the position-agnostic and position-aware graphs. All the methods (except SC_G) outperformed the baseline models in table 1 above. Our position-agnostic graphs underperformed the position-aware graphs. In addition, our position-aware graphs underperformed simple sentence extraction method of selecting top N sentences in table 3. This shows us that the position bias is very strong in news and it remains the most effective approach in selecting the positive information.
Guiding Reinforcement Learning
We use our new unsupervised evaluation metric to guide the training of a RL-based multi-document summariser, Neural Temporal Difference (NTD). We considered three unsupervised reward functions: JS, REAPER, and SUPERT (SP). SUPERT selects the top 10 – 15 sentences from each source document as pseudo references and uses SBERT to measure semantic similarity between summaries and pseudo references. The results are shown below and NTD with SUPERT yielded the strongest results.