Objective and Contribution

The objective is to use existing lead bias in news data to pretrain summarisation models on unlabelled datasets. We want the model to predict lead sentences using the rest of the article. Lead bias is a common problem in news dataset, where few sentences at the beginning of the article contains the most important information and so models trained on news dataset has a bias towards selecting those sentences and ignore sentences later on in the article.


We have collected 21.4M articles (June 2016 – June 2019) after filtering articles based on the overlapping non-stopping words ratio between the top 3 sentences and the rest of the article. A high overlapping non-stopping words ratio tells us that there is a strong semantic connection.

Evaluation is made on three benchmark news summarisation datasets:

  1. New York Times (NYT) corpus – 104K news articles

  2. Xsum – 227K news articles

  3. CNN/Daily Mail – 312K news articles


Given a news article, we take the lead-3 as the target summary and use the rest of the article as the news content as shown in the figure above. This allows us to utilise unlabelled news datasets to train our summarisation models. This pretraining method can be apply to any datasets with structural bias, for example, academic papers with abstracts or books with tables of contents. However, the pretraining needs careful examine and cleaning to ensure we have a good target summary for our content.


The abstractive summarisation model is the traditional transformer encoder-decoder architecture. We won’t go into details the architecture here. The pretraining with unlabelled Lead-3 (PL) with finetuning on target datasets is denoted PL-FT and without finetuning is denoted PL-NoFT.

What’s the data cleaning process?
  1. Remove media agencies, dates and other irrelevant contents using regular expressions

  2. Only keep articles with 10 – 150 words in the lead-3 sentences and 150 – 1200 words in the rest of the article. In addition, remove any articles where lead-3 sentences are repeated in the rest of the article. This is to filter out articles that are too long or too short and to encourage abstractive summaries

  3. Remove articles that have “irrelevant” lead-3 sentences. The relevancy is computed using the ratio of overlapping words between lead-3 sentences and rest of the article. A high overlapping words ratio means that the lead-3 sentences is a good representative summary of the rest of the article. The threshold ratio is 0.65.

Models comparison
  • Lead-X: uses the top X sentences as summary (X = 3 for NYT and CNN/DM and X = 1 for XSum)

  • PTGen: pointer-generator network

  • DRM: uses deep reinforcement learning for summarisation

  • TConvS2S: convolutional neural network

  • BottomUp: Two-step approach for summarisation

  • SEQ: uses reconstruction and topic loss

  • GPT-2: pretrained language model


The evaluation metric is the traditional ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L). The results for all three evaluation datasets are shown in the figures below:

  • PL-FT model outperformed all baseline models on both NYT and Xsum dataset. On CNN/Daily Mail, it outperformed all except BottomUp

  • PL-NoFT outperformed all the unsupervised models on CNN/Daily Mail with a significant margins. It also performed well in Xsum. PL-NoFT is the same model across all three datasets, showcasing its generalisation ability


The summaries generated by both PL-noFT and PL-FT have more novel unigram than reference summaries. PL-noFT has similar novelty ratio as reference in other n-grams but PL-FT has a relatively low novelty ratio post finetuning.

Human evaluation

Perform human evaluation on the summaries generated by the PL models and pointer-generator network. The scoring system and results are shown below. Results show that both PL-noFT and PL-FT outperformed pointer-generator network. This showcase the power of both the pretraining and finetuning strategy.

Conclusion and Future Work

The paper uses the lead bias existed in news data as the target summary and pretrain summarisation models. Our pretrained model without finetuning achieve SOTA results over different news summarisation datasets. Performance improved further with finetuning. Overall, this pretraining method can be apply to any datasets where there are structural bias.



Data Scientist

Leave a Reply