In the last part of the task-oriented dialogue, we will cover the Natural Language Generation (NLG) component and end-to-end training (E2E).

Natural Language Generation

NLG is important as it is usually the final step that converts the selected agent’s response into natural language to be send back to the user. There are many approaches investigated in this area, which can be categorised into three categories: rule-based, corpus-based, and neural methods. The rule-based requires domain experts to design a set of rules to select proper candidate to generate sentences. This is very costly and difficult to adapt to new domains and / or different user groups.

This has led to corpus-based and neural methods. The corpus-based methods aim to train and optimise a generation module from corpora using supervised learning meanwhile, the neural methods aim to use deep learning for language generation. One example of neural model is the Semantically Controlled LSTM (SC-LSTM) as shown below. It has two components: a common LSTM cell and a sentence planning cell that controls semantic during language generation. On top of the traditional gates that exist in LSTM, SC-LSTM also has the reading gate, which it’s used to compute a sequence of dialogue acts. This sequence controls what information has to be retained for future steps, which subsequently guided the utterance generation to ensure that it covers the intended meaning.

There are few improvements that can be made to the SC-LSTM. One possible way is to train another separate SC-LSTM on the reversed input sentence (similar to bidirectional) and use the output of the two SC-LSTM to rerank the utterance.

End-to-end Training

Components in dialogue systems are usually optimised individually, however, this could lead to more complex systems and harder evaluation of the system as improvements in individual components do not always translate to improvement in the whole dialogue system. With neural models, we could jointly optimise multiple components and do end-to-end training of the whole dialogue system. There are two general classes to building an end-to-end system:

  1. Supervised learning

  2. Reinforcement learning

Supervised Learning

There are many variants of supervised learning neural methods. One treated dialogue system as a mapping problem between dialogue histories and system responses. They utilise word embeddings and memory networks and outperformed baseline models on various dialogue tasks. Another similar model is Mem2Seq model which uses pointer networks to incorporate information from knowledge bases.

Another variance is the end-to-end key-value retrieval network, which uses the attention mechanism to learn to retrieve relevant information from knowledge bases.

Reinforcement Learning

Although supervised learning approaches produce good results, they do required a large volume of training data which can be very costly to obtain in dialogue systems. This had led to reinforcement learning techniques being explore to train end-to-end systems. There are two variants of LSTM based reinforcement learning: DQN and Hybrid Code Networks (HCN). DCN is able to jointly optimise policy, language understanding, and state tracking by learning to compress user utterance to infer an internal state of dialogue. HCN utilises LSTM to tracking the state and jointly optimise the state tracker and policy. The model can also incorporate business rules and prior knowledge.

In addition, it is possible to combine both supervised learning and reinforcement learning together in an end-to-end system. You can first use supervised learning on human-to-human dialogues to pretrain the policy and then use imitation learning algorithm to fine tune the policy. The last step would be to use reinforcement learning to continue policy learning with user feedback.

Other Aspects

Most of the techniques discussed so far focuses on slot-filling problems. However there are many dialogue tasks beyond slot-filling such as information-seeking and navigation dialogues. An interesting problem to investigate into is multi-modal where we develop a better context vector through using different types of input data. Another aspect could be mixed-initiative dialogues where the conversation is less unidirectional (agent helping user) and more multi-directional. For example, in negotiations dialogues or dialogues with multiple users.

In all the work we have discussed so far, a dialogue system is optimised using absolute judgement in the form of training labels or reward functions. These are often expensive methods as they require expert labels and real-life simulations. A potential alternative would be to use weaker learning signals in the form of preferential input. Rather than absolute signals, we can have a preferential input that indicates which dialogues is better. This preferential input and feedback is easier and cheaper to obtain.

Ryan

Ryan

Data Scientist

Leave a Reply