Objective and Contribution

Introduced the TLDR generation task and SCITLDR, a new extreme summarisation dataset, where researchers can use to train models to generate TLDR for scientific papers. Introduced an annotation protocol for creating different ground-truth summaries using peer review comments, allowing us to scale our dataset and for the first time, there are multiple summaries link to a single source document. Lastly, we proposed a multi-task training strategy that’s based on TLDR and title generation to adapt our pre-trained language model BART. This has shown to outperform extractive and abstractive baselines.

Introduction to TLDR Generation Task

The TLDR generation task aims to generate TLDRs that leave out background or methodology details and focus more on key aspects such as the contributions of the paper. This requires the model to have background knowledge as well as the ability to understand domain-specific language. Figure below showcase an example of the TLDR task as well as a list of categories of the type of information that appears in TLDR.


SCITLDR has 3935 TLDRs in computer science scientific documents. SCITLDR includes TLDRs written by both the original author of the paper and peer reviews. However, the key difference here is that authors and peer reviews are writing TLDR based on reviewer comments and not the original research paper. This method assumes readers to have a good background knowledge to follow the general research areas and so our TLDRs can leave out common concepts. In addition, the reviewer comments are written by experts in the field and so they are high quality summaries. Figure below showcase an example of the annotation process.

One of the uniqueness of SCITLDR is that each paper in the test set is map to multiple ground-truth TLDRs, one written by the original author and the rest by peer reviews. This would a) allow us to better evaluate our generated summaries as there are now multiple ground-truth summaries to compute ROUGE scores for, and b) having both the author and reader’s TLDR allows us to capture the variation in summaries based on the reader’s perspective.

Dataset Analysis

First of, SCITLDR is a much smaller dataset, with only 3.2K papers due to manual data collection and annotations. Secondly, SCITLDR has an extremely high compression ratio compared to other datasets. The average document length is 5009 and it’s being compressed into an average summary length of 19. This makes the summarisation very challenging. Table 3 showcase these summary statistics. SCITLDR has at least two ground-truth TLDRs for each paper in the test set and so we investigate the ROUGE score difference between different ground-truth TLDRs. There is a low ROUGEE-1 overlap (27.40) between author-generated TLDRs and PR-generated TLDRs. Author-generated TLDRs has a ROUGE-1 of 34.1 with the title of the paper. PR-generated TLDRs only has ROUGE-1 of 24.7. This showcase the importance of multiple ground-truth TLDRs in summarisation as one source document could have multiple relevant summaries.

Experimental Setup and Results

Model Training

We finetuned BART model to generate TLDR. However, there are few limitations. First of, the size of our training data. We have a small dataset for training neural networks. This has led us to collect additional 20K paper-title pairs from arXiv and up sampling our SCITLDR to match the new volume. The reason we are collecting titles is because it often contains important information about the paper and we believe if we train the model to perform title generation too, it will learn how to select important information from the paper. With the new information, we are ready to train our model. First, we train BART-large model on XSUM dataset, which it’s an extreme summarisation dataset on general news domain. Then, we would finetune our BART model on our SCITLDR and title dataset.

The second limitation we face is that BART has a limitation on input length and so we put BART under two settings: BART_abstract (SCITLDR_Abst) and BART_abstract_intro_conclusion (SCITLDR_AIC). Those are the different inputs used to generate title/TLDR. Existing works have shown that the most important information in a research paper is in the abstract, introduction, and conclusion.

Models Comparison
  1. Extractive models. PACSUM (unsupervised extension of TextRank) and BERTSUMEXT (supervised)

  2. Abstractive models. Different variations of BART

We used the ROUGE metric for evaluation. We would compute the ROUGE score for each ground-truth TLDRs and select the maximum.


The extractive oracle provides an upper bound performance. In table 6, we can see a continuous increase in ROUGE scores as the input space increases. Specifically, there are a 5 ROUGE score improvement when including introduction and conclusion as input, showcasing their importance in generating a useful summary. Although there are ROUGE score improvement from AIC to full text, the improvement is not big suggesting that the value added of other sections in the paper are not as high as AIC.

In table 5, we can see that BART finetuned on the original SCITLDR is enough to outperformed the other extractive and abstractive baselines. Further improvement is shown when pretraining BART on XSUM, however, this improvement only applies to SCITLDR_AIC. Our multitask learning strategy has outperformed all the baseline models and achieved further improvement on top of BART + XSUM. This showcase the value added of training the model for both title and TLDR generation. Figure below showcase a qualitative example of summaries generated by different models.

Conclusion and Future Work

Potential future work could make use of the information from the whole paper, capturing more context. In addition, we could explicitly model the background knowledge of the reader, creating TLDRs based on who the reader is. Lastly, we could apply our annotation process to other datasets and convert any peer review comments to TLDRs summaries.



Data Scientist

Leave a Reply