We have now covered both KB-QA and task-completion dialogue systems. The last type of chatbot to complete the whole dialogue system is social bots / conversation models. Social bots handle the friendliness and fluency of the conversation and user experience. These bots are usually train entirely from data using end-to-end (E2E) sequence-to-sequence (seq-2-seq) architecture. The reason why end-to-end training has been successful in training social bots is because a) it often does not require external API calls to a knowledge base (unlike task-completion bots) and b) this seq-2-seq can easily scale to large open-domain datasets, which covers different types of conversation that users want to take. Recent work in social bots has been focusing more on improving recommendation to users.
End-to-End Conversation Models
One of the earliest work on using data to train an E2E conversation model is using statistical machine translation (SMT) techniques. The idea behind this is that, similar to machine translation which uses previous dialogue to predict current translation, we can use previous turn to predict the current turn response. A different E2E method for conversation model uses information-retrieval methods rather than machine translation. The limitations of these two methods is their ability to generate contextually appropriate response due to their training data being (query, response) pairs. This motivated the use of RNN architecture, specifically the LSTM model, which allows the model to capture longer context and use that to better generate current response.
The Neural Landscape
As mentioned, it all begins with LSTM and since then, many different LSTM-extensions have been introduced. While LSTM model has shown that it has the ability to capture long textual contexts, it still faces the long-dependencies issue as dialogue histories can be very long and it fails to capture longer-term context. This is where hierarchical models were introduced to address this limitation. Below is the figure of a popular hierarchical recurrent encoder decoder (HRED) model, which was originally proposed for response generation and query suggestion. The HRED model consists of two-level hierarchy: one at the word-level and the other at the dialogue turn level. This models the nature of the conversation which consists of sequence of historical turns, which is made of sequence of tokens. This model allows the prediction of the hidden state of current dialogue turn to be dependent on the hidden state of the previous dialogue turn, allowing information to flow over longer context.
Encoding the entire source sequence into a single fixed-length vector has huge limitations when dealing with long source sequences as it requires the model to remember details from earlier time steps when decoding. Attention-based models were introduced to alleviate this limitation by using the probability distribution to allow the model to search and “pay attention” to different parts of the source sentence when predicting the target word. Attention models have been very effective in many NLP areas but is somewhat less effective in the E2E conversation models. We believe this is due to the fact that in dialogue, the entire source sequences may not map to anything in the target and vice-versa.
There are many variance of the seq2seq models that allow the models to better copy and paste words between conversational context and response. This is an important ability as most often you would copy the source sequence as responses and it’s also useful in handling rare words that your models haven’t seen during training. This is where pointer-network model comes in. When predicting target words, the model either generate from a fixed-size vocabulary or copy from the source sequence using the attention mechanism.