Objective and Contribution

Created two multi-sentence summarisation datasets from scientific articles: the title-abstract pairs (title-gen) and abstract-body pairs (abstract-gen) and applied a wide range of extractive and abstractive models to it. The title-gen dataset consists of 5 million biomedical papers whereas the abstract-gen dataset consists of 900K papers. The analysis show that scientific papers are suitable for data-driven summarisation.

What is data-driven summarisation?

It is a way of saying the recent SOTA results of summarisation models rely heavily on large volume of training data.


The two evaluation datasets are title-gen and abstract-gen. Title-gen was constructed using MEDLINE and abstract-gen was conducted using PubMed. The title-gen pairs the abstract to the title of the paper whereas the abstract-gen dataset pairs the full body (without tables and figures) to the abstract summary. The text processing pipeline is as follows:

  1. Tokenisation and lowercase

  2. Removal of URLs

  3. Numbers are replaced by # token

  4. Only include pairs with abstract length 150 – 370 tokens, title length 6 – 25 tokens and body length 700 – 10000 tokens

We also computed the Overlap score and Repeat score for each data pairs. The Overlap score measures the overlapping tokens between the summary (title or abstract) and the input text (abstract or full body). The Repeat score measures the average overlap of each sentence in a text with the remainder of the text. This is to measure the repetitive content that exists in the body text of a paper where the same concepts are repeated over and over again. Below are the summary statistics of both datasets.

Experimental Setup and Results

Models Comparison
  1. Extractive summarisation methods. Two unsupervised baselines here: TFIDF-emb and rwmd-rank. TFIDF-emb creates sentence representation by computing a weighted sum of its constituent word embeddings. Rwmd-rank ranks sentences by how similar the sentence is compared to all the other sentences in the document. Rwmd stands for Relaxed Word Mover’s Distance, which it’s the formula used to compute similarity and subsequently LexRank is used to rank the sentences.

  2. Abstractive summarisation methods. Three baselines here: lstm, fconv, and c2c. Lstm is the common LSTM encoder-decoder model but with an attention mechanism at the word-level. Fconv is a CNN encoder-decoder on subword-level, separating words into smaller units using byte-pair encoding (BPE). Character-level models are good at dealing with rare / out-of-vocabulary (OOV) words. C2c is a character-level encoder-decoder model. It builds character representations from the input using CNN and feed it into an LSTM encoder-decoder model.


The evaluation metrics are ROUGE scores, METEOR score, Overlap score and Repeat score. Despite the weaknesses of ROUGE scores, they are common in summarisaiton. METEOR score are used for machine translation and Overlap score can measure to what extent the models just copy text directly from input text as summary. Repeat score can measure how often the summary contains repeated phrases, which it’s a common problem in abstractive summarisation.

For title-gen results (table 2), rwmd-rank is the best extractive model, however, c2c (abstractive model) outperformed all extractive models by a large margin, including the oracle. Both c2c and fconv achieved similar results with similar high overlap scores. For abstract-gen results (table 3), lead-10 was a strong baseline and only extractive models managed to outperformed it. All extractive models achieved similar ROUGE scores with similar Repeat score. Abstractive models performed poorly based on ROUGE scores but outperformed all models in terms of METEOR score so it was difficult to draw up conclusion.

Qualitative evaluation is common and conducted on the generated summary. See below an example of the title-gen qualitative evaluation. The observations are as follow:

  1. Large variation of sentence locations selected by extractive models on title-gen, with first sentence in the abstract being the most important

  2. Many abstractive generated titles tend to be of high quality, demonstrating their ability to select important information

  3. Lstm tends to generate more novel words whereas c2c and fconv tend to copy more from input text

  4. The generated titles occasionally make mistakes by using incorrect words, being too generic and fail to capture the main point of the paper. This could all lead to factual inconsistencies

  5. For abstract-gen, it appears that introduction and conclusion sections are most relevant for generating abstract. However, important content are spread across sections and sometimes the reader focuses more about the methodology and results

  6. Output of fconv abstractive model is of bad quality where it lacks coherent and content flow. There is also the common problem of repeated sentence or phrases in the summary

Conclusion and Future Work

There was a mixed results where the models performed well in title generation but struggled with abstract generation. This can be explained by the high-level of difficulty in understanding long input and output sequences. A future work is a hybrid extractive-abstractive end-to-end approaches.



Data Scientist

Leave a Reply