Today’s post will be covering the different challenges that exist in dialogue systems.
One common challenge in conversational AI is that the generated responses are often short, bland, and deflective. This problem is believe to be due to the training objective being asymmetrical in target response and dialogue history. What this means is that the model is trained to prefer responses that achieve high probability regardless of the context. For example, “I don’t know” is a common generated response as it applies to all questions.
Maximum Mutual Information (MMI)
In order to alleviate this problem, one possible solution would be to make our training objective more symmetrical between target response and dialogue history, meaning that the model should have no bias towards responses that are bland and defective. This is known as maximum mutual information (MMI) and it’s a very challenging task to optimise this during training time and so it was apply to inference time. The MMI objective function represents the tradeoff between the probability of source given targets and the probability of targets given sources. This allows the model to learn both response appropriateness and lack of blandness. However, since this is only used during inference time, the approach first generates a list of N best responses and rescores them with MMI but the N best responses are highly bland and deflective, which reduces the effect of MMI.
Using GANs to do adversarial training, which naturally puts a Generator and Discriminator against each other. The objective for each of them is to make each other less effective. The Generator is trained to generate responses whereas the Discriminator is trained to identify whether a given response is generated by a human or the Generator. Therefore, if the Generator generates many bland and deflective responses that are unlike human responses, the Discriminator would be able to distinguish them easily. Since the goal is to make each other less effective, this will slowly steers Generator from generating responses that are obvious for the Discriminator to discriminate, making the generated responses to be closer to human responses over time.
The models we have discussed are generation-based but an alternative solution could be retrieval-based models whereby a pool of responses is constructed in advance using preexisting human responses. Although this yielded good results, it makes our model less flexibility as the number of possible responses in our retrieval system is fixed. However, this remains to be a popular commercial systems as long as you have a strong pool of responses.
It is a common problem that seq2seq methods produces inconsistent and contradictory responses. This is believe to be caused by the training data itself as in conversational data, there are one-to-many instances where one question is mapped to many different answers. For example, “How old are you?”. To counter this, Li et al. proposed a persona-based response generation system that utilises speaker embeddings on top of word embeddings. The speaker embeddings behave similarly to the word embeddings except it captures the latent space of the speakers whereby speakers with similar speaking styles or interests are map closely together. The architecture of the system is shown below. The system behaves like a normal LSTM with the addition of using the speaker embedding to generate a more personalised response.
You can also add the personalised information to a retrieval-based system where the system is essentially a binary classifier over a large database of responses. The classifier is trained with the traditional loss function, with the addition of the probability of response given the author to generate a more relevant response to a particular author.
Word or phrases repetition is a common problem in generation tasks. The dialogue generation is a much more challenging task as the given word or phrase might me map to zero or multiple words in the target. Although the attention model has been successful in alleviating the repetition errors in machine translation, for dialogue generation, it often fails to do so. One adjustment here is to add the self-attention mechanism to the decoder to improve mitigate the word repetition problem and improve generation of longer and coherent responses.
Most of the challenges above are still ongoing and require further investigation. However, a much bigger challenge is response appropriateness. Most E2E systems can produce good responses but struggles with generating names or facts that connects to the real world due to lack of grounding. For example, “what is the weather forecast for tomorrow?”, the system has no problem generating “sunny” or “rainy” but how appropriate are those responses for your area? Overall, responses are often correct but the semantic content is often inappropriate.