Since we are not using RNN or CNN, in order for transformer to account for the order of the sequence, we add vectors (known as positional encodings) to each of the input embeddings. These vectors will help the model determine the relative or absolute position of each word, enabling the model to distinguish the same word at different positions. In Vaswani et al. (2017), positional encodings were computed using the sine and cosine functions.

Although the decoder has a very similar structure as the encoder, they do differ in two ways. Firstly, its multi-head self-attention layer is masked to ensure that the decoder is only able to attend to earlier positions when generating output sequences, thereby, preserving the property of auto-regressive. Secondly, the decoder has a third sub-layer, where it performs multi-head attention over the output of the encoder stack, similar to the typical attention mechanisms in seq2seq models described in the previous section. In other words, this layer behaves like a normal multi-head self-attention layer except it is taking the output of the previous sub-layer (masked multi-head attention) as a query matrix Q and the key and value matrix K and Q from the output of the encoder. This allows the decoder to attend over all the positions in the sequence to generate its own context values to be fed into the feed-forward neural network.

The output of the decoder stack is a vector, which we feed through to the linear layer, followed by a softmax layer. The linear layer is the typical affine layer that transforms the vector into logits vector, where each cell corresponds to the score of a unique word. This means that the logits vector has the same size as our vocabulary size. These scores will then be converted into probabilities by our softmax layer and the cell (word) with the highest probability will be the output for this decoding step.