In today’s post, we will finish covering up social bots by reviewing different open benchmarks, common datasets and evaluation, and grounded conversation models.

Grounded Conversation Models

As briefly mentioned in the previous post, response appropriateness is one of the bigger challenges in conversation models as they lack grounding in the real world. Few methods have been proposed to ground the systems such as using the style or characteristics of the speaker, textual knowledge sources, user’s or agent’s visual environment, and the emotion of user. Overall, all these methods add additional input about the users or environment to the context vector. The figure below is an example of a grounded model.

The main difference between the grounded model and the standard seq2seq models is the additional facts encoder. The role of the facts encoder is to infuse relevant factual information based on conversation history. Factual information are retrieved from a large collection of world facts using selected key words from the conversation history. There are two main benefits to this approach in grounding conversation models. Firstly, it is taking the user’s environment as additional input. This allows the system to generate different responses to the same question depending on how the environment has changed. Secondly, this approach is much more sample efficient as in order for traditional seq2seq models to produce the same responses as grounded, they have to have every relevant entity be seen in the training data, which it’s unrealistic. Grounded conversation models don’t have that problem and can connect with facts that are not even mentioned in the training data.

Beyond Supervised Learning

With supervised learning, there are two main challenges. Firstly, there is a big difference between human-human conversation and human-computer conversation. This means that it is often difficult to acquire the relevant training data and subsequently optimise conversation models towards an objective. Secondly, supervised learning tend to optimise for short term rewards (which it’s why their responses are often bland) and fail to promote long term user engagement. Reinforcement learning was explored to address these limitations, however, the challenge there is defining a reliable reward function as chitchat often doesn’t have a user goal explicitly specified. With reinforcement learning, models are trained using an user simulator that mimics human’s behaviours as shown below. In this figure, user simulator is represented by a seq2seq model. The objective here is to train an agent that can maximise the expected total reward over the dialogues.

In order to create an effective reward function, Li et al. (2016) used a combination of three reward functions:

  1. Ease of answering reward. Penalises agents when turns are likely to lead to dull responses such as “I don’t know”

  2. Information flow reward. Ensures agents don’t have consecutive turns that are similar to each other

  3. Meaningfulness reward. Counters the previous two rewards as the previous two rewards encourage constant change of new information and topics. This reward ensures that consecutive turns in a dialogue are related to each other


Although there are large volume of social media data, there aren’t redistributable due to strict legal policies from different social media platforms. Most papers tend to created their own conversational data for training and testing and evaluating their systems against other baseline systems on their datasets. Some standardisation effort has been made on datasets which include:

  1. Twitter. Dataset of twitter tweets and its API contains metadata that enables the construction of conversation histories

  2. Reddit. Dataset of reddit dialogue turns and can be organised by topics and responses are not limited

  3. OpenSubtitles. Dataset of subtitles on

  4. Ubuntu. Dataset that’s more goal-oriented

  5. Persona-Chat. Dataset that contains conversational data that exhibit distinct user personas

Evaluation Metrics

Evaluation is an ongoing research in E2E dialogue and it’s extremely important in governing the research progress. As we mentioned previously, it’s often good to have human evaluation, however, this is often expensive and researchers should strike a balance between human evaluation and automatic metrics. The common evaluation metrics for E2E dialogue are borrowed from other language generation tasks such as:

  1. BLEU

  2. ROUGE

  3. METEOR. Improve BLEU by capturing synonyms and paraphrases between system output and human reference

  4. deltaBLUE. Extension of BLUE that captures numerical ratings with conversation responses

However, previous work showcase that these automatic metrics are unreliable and inaccurate in evaluating E2E systems and many existing works have been focusing on developing reliable metrics.

Open Benchmarks

There are few open benchmarks developed for E2E conversational AI to encourage research progress:

  1. Dialog System Technology Challenges (DSTC). Developed in 2017 that requires systems to be fully data-driven using Twitter data. Few of the tasks focus on grounded conversation scenarios

  2. ConvAI Competition. NIPS competition aims at training and evaluating models for non-goal oriented dialogue systems

  3. NTCIR STC. Focuses on short form conversations

  4. Alexa Prize. Amazon’s open competition for building social bots. Allows systems to be tested with real Alexa users



Data Scientist

Leave a Reply