Objective and Contribution
Introduced a 10K CSPubSum summarisation dataset. We also developed multiple models for the dataset and observed that models which capture local and global context perform the best. We introduced a new summarisation feature, AbstractROUGE, which increases summarisation performance and HighlightROUGE, which can be used to extend our dataset.
CSPubSum and CSPubSumExt
We created a 10148 computer science extractive summarisation dataset. The publications were collected from ScienceDirect and are grouped into 27 domains. Each paper in the dataset has a title, abstract, author written highlight statements and author defined keywords. The highlight statements are the gold summary statements. See the figure below for an example.
We created two different version of the dataset: CSPubSum and CSPubSumExt. The summary statistics of the two datasets are shown in the figure below. The CSPubSum consists of positive and negative examples for each paper. The positive examples are highlight statements whereas the negative examples are randomly sampled from the bottom 10% of sentences based on ROUGE-L. The test set consists of 150 full papers and it’s the set we used to evaluate the quality of summaries. The CSPubSumExt is where we used HighlightROUGE to find sentences similar to the highlights from the full paper. This allows us to extend the dataset to 263K instances for training set and 131K instances for test set.
HighlightROUGE and AbstractROUGE
The HighlightROUGE is used to generate more training data. It takes the gold summary and the text from the research papers and finds sentences that yield the best ROUGE-L score in relation to the highlights. We selected the top 20 sentences as positive instances and the bottom 20 sentences as negative instances. Note that we excluded extracting sentences from the abstracts as there are already a summary.
AbstractROUGE is a new summarisation feature that measures the ROUGE-L score between the sentence and the abstract. The idea is that sentences which are good summaries of the abstract are also likely to be good summaries of the highlights.
We experimented with two different sentence encoding methods: average word embeddings and RNN encoding. We have also selected 8 different summariser features to help encode the local and global context of each sentence:
Location. We assign integer location based on 7 different sections of the paper: Highlight, Abstract, Introduction, Results / Discussion / Analysis, Method, Conclusion, all else
Numeric Count. Measure the number of numbers in a sentence. The idea is that sentences with heavy maths are unlikely to be good summaries
Title Score. Measure the overlap between the non-stopwords of each sentence and the title of the paper
Keyphrase Score. Measure how many author-defined keywords appear in the sentence. The idea is that important sentences will contain more keywords
TFIDF. TFIDF was calculated for each word and averaged over the sentence. We ignored stopwords
Document TFIDF. Same as TFIDF except the count of words in a sentence is the TF and the count of words in the rest of the paper is the background corpus, which allows us to measure how important a word is in a sentence relative to the rest of the document
Sentence Length. The idea is that short sentences are very unlikely to be good summaries
Our models could take in any combination of the four possible inputs:
The sentence encoded with RNN (S)
Vector representation of the abstract (A)
The 8 features from previous section (F)
Average non-stopword word embeddings in the sentence (Word2Vec)
We experimented with 7 different models as listed below:
Single Feature Models. Model that only use one feature (we exclude sentence length, numeric count, and section)
FNet. A single layer NN that uses all 8 features to classify each sentence
Word2Vec and Word2VecAF. Both are single layer networks where Word2Vec takes in sentence average vector and Word2VecAF takes in the sentence average vector, abstract average vector, and handcrafted features
SNet. Feed the sentence vectors into bidirectional RNN with LSTM
SFNet. Processes the sentence with LSTM and passes the output to a fully connected layer with dropout. The handcrafted features are used as a separate inputs to a fully connected layer. The outputs of the LSTM and features hidden layer are concatenated and output the binary prediction
SAFNet. Extend SFNet by encoding abstract too. This is shown in the figure below.
SAF + F and S+F Ensemblers. The ensemble methods use weighted average of the output of two different models. SAF + F is the ensemble of SAFNet and FNet and S+F is the ensemble of SNet and FNet
Most relevant sections to a summary
First of, we want to understand which sections contribute the most to our gold summary. To do this, we compute the ROUGE-L score of each sentence against the gold summary and average sentence-level ROUGE-L scores by section. In addition, there are also many occurrences where the authors directly copy the sentences from within the main text into the highlight statements. The ROUGE score and Copy/Paste score is captured in the figure below. The title has the highest ROUGE score which it’s as expected. However, surprisingly, the introduction has the third-lowest ROUGE score, however, it does have the second highest Copy/Paste score. We believe this contradictory results of the introduction section is due to the length of the section. The introduction section is long and this is bad for the ROUGE score as it contains more sentences that are not good for the summaries. However, more sentences in introduction section also means that there are more potential sentences to be use as highlights as demonstrated by the high Copy/Paste score.
Model performance and error analysis
The figure 3 below showcase the ROUGE-L score of the different models. Our ensembled models significantly outperformed the baseline models showcasing the effectiveness of our sentence encoding and features.
In figure 4 below, we compared the performance of all the models we developed in this paper. We found that architectures that uses sentence encoding and our handcrafted features performed the best by both ROUGE scores and test set accuracy. LSTM was able to outperformed the average word embeddings method which tells us that the ordering of words in a sentence is important. Another observation is that the highest accuracy result does not translate to the highest ROUGE score although they are strongly correlated. SAFNet achieved the highest accuracy on CSPubSumExt but underperformed AbstractROUGE summariser on CSPubSum. We manually examined 100 sentences from CSPubSumExt that were misclassified by SAFNet. We found that the primary reasons of false positives was lack of context and long range dependency. Other reasons for false positives include mislabelled and including maths heavy sentences.
The primary reason for false negatives are mislabelled data and failure to recognise entailment, observation or conclusion. Overall, a high accuracy does not equates to high ROUGE scores and this is most likely due to overfitting to the training data that has mislabelled examples.
Effect of using ROUGE-L to generate more data
The figure 5 below showcase the performance difference between three selected models on CSPubSumExt (full data) and CSPubSum (low data). Across all three models, we see a consistent improvement when trained on full data suggesting that increasing training data using ROUGE-L does improves the summarisation performance.
Effect of AbstractROUGE metric on summariser performance
Figure 6 below showcase the performance of 4 models trained with and without AbstractROUGE. We observed that AbstractROUGE does improve the performance of summarisation techniques and that sentence encoding and features engineering lead to a more stable model.
Conclusion and Future Work
Potential future work involves developing model to better capture the global context and the cross-sentence dependencies.