Objective and Contribution

Released SCIBERT, a pretrained language model trained on multiple scientific corpuses to perform different downstream scientific NLP tasks. These tasks include sequence tagging, sentence classification, dependency parsing, and many more. SCIBERT has achieved new SOTA results on few of these downstream tasks. We also performed extensive experimentation on the performance of fine-tuning vs task-specific architectures, the effect of frozen embeddings, and the effect of in-domain vocabulary.

Methodology

How is SCIBERT different from BERT?
  1. Scientific vocabulary (SCIVOCAB)

  2. Train on scientific corpuses

SCIBERT is based on the BERT architecture. Everything is the same as BERT except it is pretrained on scientific corpuses. BERT uses WordPiece to tokenise input text and build the vocabulary (BASEVOCAB) for the model. The vocabulary contains the most frequent words / subword units. We use the SentencePiece library to construct a new WordPiece vocabulary (SCIVOCAB) on scientific corpuses. There’s a 42% overlap between BASEVOCAB and SCIVOCAB, showcasing the need for a new vocabulary for dealing with scientific text.

SCIBERT is trained on 1.14M papers from Semantic Scholar. Full text of the papers are used, including the abstracts. The papers have the average length of 154 sentences and sentences are split using ScispaCy.

Experimental Setup

What are the downstream NLP tasks?
  1. Named Entity Recognition (NER)

  2. PICO Extraction (PICO)

  3. Text Classification (CLS)

  4. Relation Extraction (REL)

  5. Dependency Parsing (DEP)

PICO extraction is a sequence labelling task that extracts spans within the text that describes Participants, Interventions, Comparisons, and Outcomes in a clinical trial paper.

Models comparison
  1. Two BERT-Base models. The normal BERT with BASEVOCAB cased and uncased versions

  2. Four SCIBERT models. Cased and uncased and with BASEVOCAB and SCIVOCAB versions

Cased models are used for NER and uncased models are used for all the other tasks.

Finetuning BERT

We follow the same methods in finetuning BERT for various downstream tasks. For CLS and REL, we feed the final BERT vector for the [CLS] token into a linear layer. For sequence labelling (NER and PICO), we feed the final BERT vector of each token into a linear layer. For dependency parsing, we use a model with dependency tag and arc embeddings and biaffine matrix attention over BERT vectors.

Frozen BERT embeddings

We explore using BERT as a pretrained contextualised word embeddings on top of simple task-specific models to see how it performs on these NLP tasks. For text classification, it is a 2-layer BiLSTM with a multi-layer perceptron. For sequence labelling, it is a 2-layer BiLSTM and a conditional random field (CRF). For dependency parsing, it is the same model as above with a 2-layer BiLSTM.

Results

The results are split into three sections: biomedical domain, computer science domain, and multiple domains. The high-level results showcase SCIBERT outperforming BERT-Base on scientific text and achieve new SOTA on many of the downstream tasks.

For biomedical domain, SCIBERT outperformed BERT on all seven biomedical datasets, achieved SOTA results on four datasets and underperformed SOTA on the other three datasets. In the figure below, we did a direct comparison between SCIBERT and BIOBERT (a larger model) and found that SCIBERT outperformed BIOBERT on two datasets and performed competitively on the other two datasets as shown below:

For computer science and multiple domains, SCIBERT outperformed BERT and achieved SOTA results on all five datasets. All the results discussed above are shown below:

The results also showcase the strong effect of finetuning BERT rather than task-specific architectures on top of frozen embeddings. Fine-tuned BERT consistently outperformed frozen embedded models and outperformed most of SCIBERT with frozen embeddings except two datasets. We also assess the importance of an in-domain vocabulary and observed an 0.60 increase in F1 when using SCIVOCAB. The magnitude of improvement showcase that although in-domain vocabulary is useful, it is not the key driver. The key driver is the pretraining process on scientific text.

Conclusion and Future Work

On top of achieveing SOTA results on few downstream tasks, SCIBERT also scored competitively against BIOBERT on biomedical tasks. In future work, we would release a larger version of SCIBERT (matching BERT-Large) and experiment with papers from different domains with the goal of training a single summarisation model that works across multiple domains.

Ryan

Ryan

Data Scientist

Leave a Reply