What’s so special about DIET?

DIET stands for Dual Intent and Entity Transformer. DIET is a multi-task transformer architecture that can perform both intent classification and entities recognition together. It is made of multiple components that allows us to have the flexibility to exchange different components. For example, we could experiment with using different word embeddings such as BERT and GloVe.

A lot of the pre-trained language models are very heavy in the sense it requires large compute power and inference time is long and so despite their strong performance, they aren’t design for conversational AI applications. DIET is different as it:

  • Is a modular architecture that allows software developers to have more flexibility in experimentation

  • Matches pre-trained language models in terms of accuracy

  • Outperforms current SOTA and it’s 6X faster to train

What’s the DIET architecture?

First of all, what training data do we need to train the DIET model? The architecture requires the dataset to have the input text, label(s) of intent(s) and label(s) of entities.

The architecture has many components and it has a total loss to optimise (minimise) that is made up of three losses:

  1. Entity loss
  2. Intent loss
  3. Mask loss

Entity Loss

How’s the input sentences link to the entity loss during training? Below are the layers linking the input sentences to the entity loss:

  1. Individual token pathway
  2. Transformer layer
  3. Conditional Random Field (CRF)

Individual token pathway is broken into two sub-pathways

  1. Pretrained embeddings. This can be BERT or GloVe. Here you can experiment with different embeddings. The output is the numerical representation of the token

  2. Sparse features + Feed Forward Neural Network (FFNN). Consists of one-hot encoders of character level N-grams as features and passed to a feed forward layer

The output of the two sub-pathways are merge together and feed into another FFNN. The output of the FFNN is a 256-dimension vector.

The outputs of the individual token pathway are feed into a 2-layer transformer layer. Click here for a good article on transformers. Subsequently, the outputs of the transformer layer are feed into the conditional random field (CRF) layer. Inside the CRF, we have a FFNN that takes the output of the transformer layer and classify what entity is the output. For example, the word “ping” has the entity of game_name. In between these FFNN, we have a transition matrix. The idea behind the transition matrix is to capture the situation where if there’s a token that’s an entity, its neighbouring tokens has a high probability to be an entity too. For each token, we have the label of the ground-truth entity and this can be used during training to train both our FFNN and the transition matrix (their weights).

Intent Loss

There is a special class token (__CLS__) in the DIET architecture figure above. The idea behind this special class token is that it would summarise the entire input sentence and derive a numerical representation that represents the whole input sentence. This special class token follows the same pathway as the individual tokens, however, the output of the pretrained embeddings and sparse features are slightly different:

  1. The output of the pretrained embeddings is now a sentence embedding. This is computed differently depending on which pretrained embeddings are used.

  2. The sparse features for the special class token are the sum of all the separate sparse features of individual tokens.

Since the class token is the summarisation of the entire input sentence, the class token should be able to predict the intent. The special class token will go through the individual token pathway, the transformer layer, and then through to the embedding layer. Concurrently, the ground-truth intent of the input sentence goes through the embedding layer. Similarity (and so intent loss) is being computed between the output of the two embedding layers.

Mask Loss

The addition of this mask token in the architecture is so that the model can also be train as a language model. A language model is where the model predicts the most suitable next token given a set of input tokens. During training, the model would randomly mask some words and it would be the objective of the algorithm to predict what’s the original word that was being masked. The diagram below shows how this works. The mask token will passed through the transformer and into an embedding layer. Concurrently, the masked token (the word pong in the figure) goes through the individual token pathway and also into an embedding layer. A similarity function is computed between these two embeddings. One of the objectives of the model is to minimise the mask loss. The lower the mask loss, the better the model is in predicting the masked token.

FFNN Characteristics

Two special note about all the FFNN in the architecture. Firstly, they are NOT fully connected. The FFNN has a dropout rate of around 80% from the beginning. This makes the FFNN more lightweight. Secondly, ALL the FFNN share the same weight. All the FFNN post sparse features share weights (W1) and all the FFNN post merging the output of the two paths share another set of weights (W2).


Why are they using mask token and training a language model again when they are already using pre-trained models? To allow the model to adapt to the domain of the dataset. Especially in a chatbot or social media context, there are more misspelled words or slangs or commanding texts and so training a language model again would allow the model to capture these domain specific language.

The architecture is designed to allow the model to learn a more general representation of our input sentences. During training, all the weights have to be optimised based on three different losses: entity, intent, and mask loss and as such, the model can’t just learn a representation that heavily minimises one of the losses. In addition, the architecture is designed in such a way that you can switch on or off multiple components. The architecture is designed to handle intent and entity classification but if we just want the model to do intent classification, we can “switch” off the entity and mask loss and just focuses on optimising the intent loss during training. I thoroughly enjoy learning about the RASA’s DIET model and the next step is to experiment with the RASA library.



Data Scientist

Leave a Reply