This is a multi-series blog post on the Transformer architecture, of which many state-of-the-art models, such as BERT, are built upon. In this post, I will give a brief overview of the different components of the transformer architecture and expand each of these components in subsequent posts.
Transformer was introduced by Vaswani et al. (2017), which based solely on the attention mechanisms and feed-forward neural networks. It has achieved state-of-the-art performance in machine translation task and requires significantly less time to train due to it being more parallelisable in comparison to recurrent neural network and convolutional neural network.
The overall Transformer architecture is shown below. A Transformer consists of two main components: an encoding component and a decoding component. The encoding component is essentially a stack of encoders and the decoding component is a stack of decoders. All the encoders have the same structure with different weights and they consist of two sub-layers: multi-head self-attention mechanism (Multi-Head Attention) and position-wise feed-forward neural network (Feed Forward). The position-wise feed-forward neural network is applied to each input position independently and therefore, can be executed in parallel. Residual connection and layer normalisation are applied to each of the two sub-layers (Add & Norm).