In this post, I will briefly highlight the different word embeddings used in NLP and their limitations as well as the current state-of-the-art (SOTA) contextualised embeddings.


In traditional NLP, words are regarded as discrete symbols, represented using one-hot vectors, where the dimension of the vector is equal to the size of the vocabulary. This method of representing words is known as Bag-of-Words (BoW) and sentences could be represented by a vector, where each entry (1s) tells us which known words are present in the sentence. The problem with BoW is that the sequential nature of the sentence is discarded as the method only cares about whether known words are present in the sentence. This means that BoW will have the same representation for sentences that contain the same words but in different order. In addition, one-hot vectors are orthogonal and so we won’t be able to capture the similarity of different words.

Distributional Word Embeddings

To solve the issues of BoW, word embeddings were introduced. There are three popular types of word embeddings: Word2Vec (Mikolov et al., 2013), GloVe (Pennington, Socher, and C. Manning, 2014) and FastText (Bojanowski et al., 2017). Word2Vec is one of the most popular framework for mapping word representation to a vector space using a large corpus of text.

Using Word2Vec, the word embedding for a particular word is built using different context words that surrounds that particular word. Word2Vec has two variants: Skip-grams and CBoW (Continuous BoW). By utilising this framework, we will be able to map semantically similar words to similar vector representations given that they are most likely to be surrounded by similar context words. However, Word2Vec training time increases as corpus size increases, which means it can’t be scale efficiently. In addition, words are mapped to high dimensional space, which is not memory efficient.

GloVe solves these issues by using two main word embedding methods: global matrix factorisation and local context window, making GloVe faster to train and scalable to large corpuses. However, both types of word embeddings suffer from out-of-vocabulary (OOV) problem, where they don’t have vector representations of words that they have never seen before during the training phase. In order to tackle the OOV problem, FastText (Bojanowski et al., 2017) was introduced, where representation of words are constructed through sub-word information of the words.

Contextualised Word Embeddings

All the word embeddings described above are known as distributional (or fixed) word embeddings. This means that each word has only one vector representation that’s formed during the training phase and this vector representation is used for the same word in different context. However, as we know, the same word in different contexts has different meanings. Therefore, it makes more sense for the vector representation of the same word to be different depending on the context of which it appears in. To address this, contextualised word embeddings such as ELMo (Peters et al., 2018) was introduced, whereby word embeddings of the same word can change depending on the context it appears in.



Data Scientist

Leave a Reply