Objective and Contribution
The first paper to perform abstractive summarisation on long-form documents (research papers). The architecture consists of a hierarchical encoder that captures the discourse structure of a research paper and an attentive discourse-aware decoder to generate the summary.
The contributions of this paper are:
Proposed an abstractive model for summarising research papers
Introduced two large-scale datasets of long structured research papers obtained from arXiv and PubMed
Discourse-aware Summarisation Model
Our encoder is a hierarchical RNN that captures document discourse structure. The encoder first encode each discourse section by parsing in all the words into their respective section RNN. We then takes the outputs of all section RNNs and feed the hidden states into another RNN to encode the whole document.
At each decoding step, our decoder takes in the words of the document and also attend to the relevant discourse section. We would use the discourse-related information to modify word-level attention function. At each decoding step, the decoder would use the decoder state and context vector to predict the next word in the summary.
We added an additional binary variable to the decoder to determine whether the decoder should generate a word or copy a word from the source. The copy probability is learned and optimised during training.
We track attention coverage to avoid the problem of generating repeated phrases or words. The coverage vector includes information about the attended document discourse sections and it’s incorporated into the attention function.
ArXiv and PubMed
We introduced the two research papers dataset: arXiv and PubMed. During our data collection process, we removed any documents that are too long or too short or do not have an abstract or discourse structure. We remove any figures and tables and normalise any math formulas and citation markers with special tokens. The abstract is the ground-truth. The dataset statistics are displayed below with average document length being 3000 – 5000 words.
Experiments and Results
We used ROUGE score as evaluation metric and we compare our method with different benchmark models as below:
LexRank, SumBasic, LSA (extractive)
Attention seq2seq, PG network (abstractive)
The two tables below showcased the results on arXiv and PubMed dataset. The results show that our discourse-aware model was able to outperformed all the baseline models, both extractive and abstractive.
We also performed some qualitative evaluation and we observed that our model was able to generate summaries that not only capture the problem introduction like other SOTA benchmark models but also able to capture the methodology and impacts of the paper.