How to capture the sequential nature of sentences when using Transformers?
Add positional encoding to the input word embedding to give the transformer an idea of the relative position of the word.
What is residual block?
A residual block is where each layer feeds into the next layer and into the subsequent few layers as illustrated below. It allows information to flow from one layer to the last layer in fewer steps by skipping a few layers.
What’s the purpose of masking in transformer?
The attention layer uses masking for two reasons:
We mask the PAD tokens to avoid our neural network paying attention to the padding
We mask future tokens because our transformer isn’t recurrent and so it shouldn’t have access to “future” tokens
How does the transformers use the multi-head attention?
Multi-head attentions uses three different matrices: Queries, Keys, and Values. It uses multi-head attention in three different ways:
The encoder-decoder layer. Here, we use queries from the previous decoder layer and the keys and values from the output of the encoder layer (context), allowing the decoder at each time step to attend to all the input tokens
The encoder layer. Here, all the queries, keys, and values come frim the previous layer in the encoder, allowing the current encoder layer to attend to all positions in the previous layer
The decoder layer. Here, the multi-head attention mechanism behaves similarly to the encoder layer and uses masking to prevent the attention mechanism to attend to “future” tokens