Objective and Contribution

Proposed an unsupervised method that uses word embeddings and domain knowledge to extract context from reference paper and achieved SOTA results. Citation contextualisation can improve citation text and downstream applications such as scientific summarisation.

What is citation text?

Citation text are texts that highlight contributions of the referenced paper. Existing work has shown that citation text are useful for many downstream tasks such as search and summarisation. The problem with citation text is that it often lacks context such as methodology, assumptions or conditions for the obtained results.

Contextualising Citations

Our approach extends language models by incorporating word embeddings and domain ontology. The purpose of the language model is to rank a document (reference spans) according to the query (the citation text). We used embeddings to score documents that are semantically similar to the chosen query. We also used domain ontologies to capture information that might not be included in the embeddings. The domain ontologies can extend the embedding based language models as follows:

  1. Retrofitting. We used MESH and Protein Ontologies (PO) to modify word embeddings and pull synonymous words closer together

  2. Interpolating. We directly modified the language model and incorporate the domain knowledge


We used TAC 2014 Biomedical Summarisation dataset which consists of 220 scientific biomedical journal articles and 313 citation texts (which include relevant contexts). Our baseline models are:

  1. BM25

  2. Vector Space Model

  3. Dual Embedding Space Model

  4. Language Modelling with LDA smoothing

We compare the performance of training using the Wikipedia embeddings and biomedical embeddings. In addition, we also compare the performance of biomedical embeddings with retrofitting and with interpolating domain knowledge. In terms of evaluation metrics on the quality of extracted citation contexts (intrinsic evaluation), we have evaluation metrics for assessing the quality of retrieved contexts for each citation from multiple aspects:

  1. Character offset overlaps between retrieved contexts and human annotations in terms of precision, recall, and F-score

  2. nDCG scores where we treat any partial overlaps with the goal standard as correct context

  3. ROUGE-N scores to measure content similarity between retrieved contexts and gold standard

  4. Character precision at K


Intrinsic Evaluation

The results are displayed in the table 1 below. As shown, our model of biomedical embeddings with domain knowledge interpolation achieved the best performance in most of the evaluation metrics. This indicates that our models are effective in many different aspects. We found that general embedding does not provide much advantage when compared to the best performing baseline model. However, when we used the domain-specific embeddings, we observed a 10% c-F improvement over the best performing baseline model. This is as expected as the biomedical embeddings are better at capturing the word relations in biomedical context. This is illustrated by table 2 where we showcase top relevant words for the word “expression” provided by the two types of embeddings. The last two rows of table 1 showcase the benefit of including domain ontologies and we found that interpolation yielded a stronger improvement than retrofitting.

Our models also correlate well with human annotations. As shown in table 3 below, when the human precision is high (upper quartiles), our system performs better in c-F metric with higher confidence (lower standard deviation).

External Evaluation

Citation-based summarisation can effectively capture different contributions and aspects of a research paper by using the citation texts. Here, we compare the summarisation quality between no contextualisation and our proposed contextualisation approaches. We have multiple baseline models and our evaluation metric is ROUGE scores:

  1. LexRank

  2. LSA-based

  3. SumBasic

  4. KL-Divergence

The results are displayed below. The “No context” showcase the performance of all summarisation approach on solely the citations and without any context. All the other rows are different summarisation algorithms on citation text with our contextualisation methods. Our best performing model is biomedical embeddings with domain knowledge interpolation, which significantly improves the quality of summaries in terms of ROUGE scores. This showcase that the intrinsic quality of citation contextualisation has a direct impact on the end-quality of generated summaries.



Data Scientist

Leave a Reply