Natural Language Understanding (NLU) and Dialogue State Tracking (DST) are the two important components that can heavily affect the performance of task-oriented dialogue system.

Natural Language Understanding

NLU is usually the initial preprocessing step in the dialogue system and so the performance of NLU heavily affects the performance of other modules in the system. There are three tasks that NLU tackles and the output is as shown below:

  1. Domain classification

  2. Intent classification

  3. Slot tagging

The first two tasks are classification tasks and so the common evaluation metrics are accuracy and F1 scores. Recent work has been focusing on developing neural models to perform multi-class classification on the first two tasks and results are strong. The most challenging task of the three is slot tagging. This is often treated as sequence classification where classifier predicts the semantic class labels for each token in the utterance as shown above. One common architecture to tackle NLU tasks is the biLSTM as shown below. The concatenation of the two hidden layers (forward and backward) are feed into another neural network to compute the output sequence. For domain and intent classification, the output is predicted at the end of the input sequence whereas for slot tagging, the output is predicted after the embedding layer.

There are many extensions made on top of the biLSTM. Some architectures uses information from previous utterances and treated conversation history as a long sequence of words since it is common that present utterance alone might be ambiguous. Other architectures uses memory networks to learn which part of the utterance to attend to for slot-tagging.

The three NLU tasks are often solved separately but there are benefits in building an architecture that does multi-task learning over multiple domains. Another interesting direction is zero-shot learning, where slots from different domains are represented in a shared semantic space using embeddings.

Dialogue State Tracking

Dialogue state is important as it tells the system what information the user is looking for at the current turn and what action the system should take next. The dialogue state consists of:

  1. Goal constraint for every informable slot

  2. The subset of requested slots that the user asked the system for

  3. The current dialogue search method, which encodes how the user is trying to interact with the dialogue system

In the past, DST are created by experts or from data but again, recent strong results by neural methods have led to many applications being developed by neural methods. The figure below showcase the recent DST model: Neural Belier Tracker. The model takes in three inputs:

  1. System utterance

  2. User utterance

  3. Candidate pairs

The first two inputs are mapped to an internal vector representation using pretrained word embeddings. The third input is any slot-value pair tracked by DST and is also mapped to an internal vector representation. The three embeddings would then interact between themselves for context modelling and semantic decoding. The goal here is to capture further context information from the flow of the conversation (using all three inputs) and also if the user has explicitly expressed any intent that matches with input slot-value pair (using the last two inputs). Finally, both outputs from context modelling and semantic decoding are feed into a softmax layer for final prediction. This process is repeated for all candidate slot-value pairs.

Dialogue State Tracking Challenge (DSTC)

This is a series of challenges that provide common test sets and evaluation metrics for DST. The datasets in DSTC covered both human-computer and human-human conversations, different domains and across different languages.

Ryan

Ryan

Data Scientist

Leave a Reply