Objective and Contribution

We explore the settings, in terms of size of training set and linguistic properties of the task, for which contextualised embeddings give a large improvements in performance relative to classic pretrained embeddings and random word embeddings. We found that on tasks with labelled data and simple language, both baseline embeddings can match the performance of contextualised embeddings. In addition, we found that contextualised embeddings give large gains in data where there are complex structure, ambiguous words, and words unseen in training.


We selected named entity recognition (NER), sentiment analysis, and natural language understanding (NLU) tasks to compare the performance of using contextualised embeddings against GloVe and random embeddings. The rationale for these three tasks is because it covers lexically diverse tasks of word, sentence, and sentence-pair classification tasks.

Impact of Training Data Volume

Training data has a big impact of the relative performance of contextual vs non-contextual embeddings. In the figure below, we show that as we increase the size of training data, the performance of non-contextual embedding increases at a faster rate. We then applied whole dataset across different tasks and show that non-contextual embeddings often 1) perform within 10% absolute accuracy of the contextual embeddings, and 2) match the performance of contextual embeddings trained on 1x – 16x less data. This tells us that for certain tasks, it is better to use contextualised embeddings for the large efficiency gains that triumphs costs of labelling more data.

Study of Linguistic Properties

We explored three linguistic properties for each NLP task that allows us to distinguish between simple language and more unstructured and diverse text:

  1. Complexity of sentence structure. How interdependent are different words in the sentence?

  2. Ambiguity in word usage. How many different labels does each word has in the training data?

  3. Prevalence of unseen words. How likely are we to encounter unseen words during inference time?

Complexity of sentence structure

We suspect that non-contextual embeddings would struggle with complex language and as such we define the metrics as follows:

  1. NER. The complexity is measured by the number of tokens that spanned by an entity. The longer the number of tokens, the more complex it is

  2. Sentiment analysis. The complexity lies is measured by the average distance between pairs of dependent tokens in a sentence’s dependency tree. The longer the distance, the more complex it is

Ambiguity in word usage

We suspect that non-contextual embeddings would perform poorly in disambiguating words and as such we define the metrics as follows:

  1. NER. The ambiguity is measured by the number of different labels a token appears in the training set

  2. Sentiment Analysis. The ambiguity is measured by the entropy of a coin flip using the average probability of the word being positive within a sentence

Prevalence of unseen words

We suspect that contextual embeddings perform better on unseen words and as such we define the metrics as follows:

  1. NER. Measured the inverse of the number of times a word is seen in the training set

  2. Sentiment Analysis. Measured the proportion of words in the sentence that were never seen in training

Empirical validation of metrics

For both the tables below, we showcase, for each metrics defined above, that the accuracy gap between contextualised embeddings and non-contextualised embeddings is larger on inputs where the metrics are large. We measured this by splitting the validation sets into two halves using the median metric value and measure the performance gap on both the lower half and higher half of the metrics. We observed that, with random embeddings, 19 out of 21 cases, the accuracy gap between BERT and random embeddings is larger on the validation set where the metrics are large. We also observed similar results with GloVe embeddings, where 11 out of 14 cases, we observed that the gap between GloVe and BERT errors is larger on validation sets where values are larger than the median.



Data Scientist

Leave a Reply