It’s always good to revisit concepts that are the foundation of what drives today’s NLP development. Today, I restudied Jay Alammar’s blog posts on NLP and Transfer Learning with a focus on BERT. This is also a good way to prepare for NLP interviews because if you are using BERT at work, you are more likely to get ask how does BERT works!
What are the two steps of how BERT is developed?
The two steps are:
Semi-supervised training on large corpus. This involves pre-training objectives of masked language model and next sentence prediction task
Fine-tuning pre-trained model on specific task using labelled dataset
How was BERT developed?
BERT uses different strong NLP ideas such as semi-supervised sequence learning (MLM and next sentence prediction), ELMo (contextualised embeddings), ULMFiT (Transfer learning with LSTM), and lastly, the Transformer.
What is BERT?
BERT is essentially a stack of Transformer Encoder (there’s no decoder stack). There are two model sizes of BERT: the BERT BASE, which has 12 encoder layers, is a comparable size to OpenAI GPT and the BERT LARGE, which has 24 encoder layers. In addition, both BERT sizes have larger feed forward neural networks and more attention heads than the original Transformer architecture.
BERT has two-pretrain objectives: MLM and next sentence prediction. ELMo’s language model is bidirectional whereas OpenAI Decoder Transformer is unidirectional. The issue with using transformer encoders for bidirectional is that bidirectional conditioning would allow each word to indirectly see each other. However, BERT was able to solve this issue by masking and masked language model(MLM) and so BERT is a transformer encoder stack! In addition, the next sentence prediction task allows BERT to better understanding the relationship between multiple sentences.
BERT can also be used to create contextualised word embeddings!
What’s special about ELMo and how does it work?
For machine to understand text information, you need to first encode those text into numerical representation. There are many ways to do that. The simplest way is Bag-of-Words. Word2Vec, GloVE, and FastText are distributional word emeddings that encodes words in vectors that capture the semantic, syntactic, and grammar-based relationships between words. For example, “London” and “UK”, or “had” and “has”.
The three distributional word embeddings are fixed meaning that each word has fixed vector regardless of context. As we know it, word changes meaning depending on the surrounding context. ELMo is a contextualised word embeddings where each word embedding changes depending on the surrounding context. ELMo takes in the entire sentence using bidirectional LSTM to encode each word into vectors. ELMo was trained to predict the next word in a sequence of words (language modelling).
What is OpenAI Transformer (GPT)?
OpenAI Transformer pre-trains a transformer decoder for language modelling. The decoder is a good choice because it’s built to mask future tokens, key for language modelling. There’s no encoder. We can pre-train the OpenAI Transformer by feeding it with a large corpus of text. To fine-tune for downstream tasks, simply take the output of the OpenAI Transformer and feed it into a feed forward network with softmax function. Note that there are different input transformations for different tasks.
What’s the process of using BERT for sentiment classification?
Use DistilBERT or pre-trained BERT to generate sentence emebddings. This involves using DistilBertTokenizer to tokenise the text (wordpiece) with additional special tokens ([CLS] and [SEP]). The tokeniser then convert each token into its respective id from the lookup table.
Collect the first output hidden state ([CLS]) as that’s all we need for classification
Train / Test split the dataset for our sentiment classifier (logistic regression in the tutorial)
Train the classifier using sklearn
Evaluate the classifier on test set