The self-attention mechanism, also known as Scaled Dot-Product Attention, involves three different inputs: queries, keys and values vectors. These vectors are created by multiplying three different matrices, that are trained during the training process, with either the word embeddings or the output of the previous encoder in the stack. Once we have computed the vectors, we would need to compute the attention score, which tells the model which part of the sequence to focus on where encoding a word. The score is computed by taking the dot products of the query with all the other key vectors in the sequence. Once the score has been computed, we divide each score by the square root of the dimension of the key vectors and apply the softmax function to obtain the weights on the value vectors. Finally, we multiply each value vector by their corresponding weights and sum them up to produce the output of the self- attention layer at this position. In practice, we will be dealing with multiple queries, which could be stacked together to form matrix Q and given matrix K and V, which consists of set of keys and values, we can compute the context vector for each queries vector as follows:

where dk is the dimension of the key matrix, K. This concludes the process of a single attention function. The transformer network involves multi-head self-attention, which involves performing multiple single attention function (multiple heads), each with a different set of query, key and value weight matrices. This is illustrated by the following formula:

The purpose of multi-head self-attention is that it allows the model to focus on information from multiple representation subspaces at different positions. These matrices are initialised randomly and are used to project inputs into different representation subspace. We can then perform multiple attention function, in parallel, using these different representation subspaces (queries, keys and values). Therefore, a transformer with 8 attention heads as outlined in Vaswani et al. (2017) means that we will end up with 8 different output matrices. Seeing as the feed-forward neural network only expects a single output matrix, we condense the output matrices into a single matrix by concatenating them and multiplying it by another weight matrix that was also trained during the training process. This is illustrated as follows: