Today’s post will cover the various evaluation methods when it comes to evaluating the performance of the task-oriented systems. We will compare the trade-offs and find the optimal evaluation metrics.

Evaluation Metrics

There are multiple components in a dialogue system as mentioned in previous posts. Each component has its own evaluation metrics such as accuracy, F1 score, and BLEU score. However, combining these evaluation metrics together to evaluate the dialogue system as a whole is very challenging. In reinforcement learning, the reward function is usually the weighted linear combination of individual metrics.

There are many metrics that we could include in the reward function. There are three class of metrics:

  1. Task completion success. This could be capture using the task success rate which it’s the proportion of dialogues that successfully completed the task. Therefore, the reward for each turn is 0 until the last turn which it can either be +1 for success or -1 for failure

  2. Dialogue cost. This could be the number of turns in the whole conversation. Ideally, we want to be able to solve the user’s problem in as few turns as possible. A simple way to incorporate this into the reward function would be -1 for every turn

  3. Other aspects of dialogue quality. This could include coherency, diversity, and personal styles

Simulation-based Evaluation

Using the reinforcement learning to train the dialogue system means our system needs to interact with users to learn. This could be very expensive to do. As a result, lots of research has been focusing on developing realistic user simulations. There are many different ways to categorise user simulations: deterministic vs stochastic, content-based vs collaboration-based, static vs non-static, etc. We will consider the following two dimensions:

  1. Granularity. Operate at the dialogue-act level (intention) or utterance level

  2. Methodology. User simulation can be implemented using rule-based or model-based approach, learning from real conversation data

Agenda-based Simulation

In this simulation, the conversation begins with a randomly generated user goal and this is unknown to the dialogue manager. This user goal has two components as shown below. Firstly, it imposes constraints on the dialogue using inform-slots. For example, booking a particular film for x number of people. Secondly, it has few request slots that the user aims to fill in throughout the conversation. For example, which cinema and start time. These slots are domain-specific. The key to this agenda-based simulation is how to maintain the user agenda at each turn of the conversation. A user agenda is a stack data structure where each entry corresponds to the pending intention the user aims to achieve. The priorities of this stack is first-in-last-out.

Model-based Simulation

This simulation has the same opening as the agenda-based simulation and is entirely based on data. At each turn, the user will take in all the contexts collected so far in the conversation, feed into an LSTM, and outputs the next action. The context collected includes the most recent machine action, constraint and request status, and inconsistency between machine information and user goal.

Human-based Evaluation

Since the user simulation approach is far from human users, it is still good practice to perform human-based evaluation. There are two types of human users:

  1. Human subjects. Through crowd-sourcing platforms, they are recruited to test-use the dialogue system. All usual metrics are measured.

  2. Actual users. This group is similar to human subjects except they are real user with real goals (tasks). This is more reliable but higher risk of negative user experience. You can launch your dialogue system in practice so that users can use them and report back any bugs or failures

Although this is the best approach to evaluate our dialogue system, it has few limitations:

  1. Expensive and time-consuming to get a large enough human subjects to drive any significant differences in metrics, leading to inconclusive results

  2. It is impractical to run and train RL agent that learns from interactive with users

Other Evaluation Techniques (Self-play Concept)

Most recent work on evaluation techniques have been focusing on the self-play technique in reinforcement learning. This is usually apply to a two-player game where both players are control by the agent. With this method, a large amount of iterations can be performed with relatively low costs, allowing the agent to learn a good policy. This self-play concept has to be adapted to dialogue systems as with dialogue systems, we have two parties with different roles. Strong results are seen in negotiation and task-oriented dialogues with this concept.

Ryan

Ryan

Data Scientist

Leave a Reply