26.5 Evaluating Dialogue Systems
How can we evaluate dialogue systems?
User’s satisfaction rating – very costly if you were to run this every time you changed or updated new features
Task completion success – measure the correctness of the end solution. This could be slot error rate, measuring the percentage of slots that were filled correctly. We could also have slot precision, recall, and F-score. The slot error rate is the most important evaluation metric
User’s perception of task completion – sometimes this is a better predictor that user satisfaction
Task error rate – how many tasks did the system correctly performed?
Efficiency – measures how efficiency the system is in helping users. There are many proxy measurements for this
Quality – measures other aspects of the interactions. For example, how often did ASR failed to recognise user’s utterance?
26.6 Dialogue System Design
What is voice user interface?
It’s the design of dialogue strategies, prompts, and error messages. It generally follows the user-centered design principles.
What are the user-centered design principles?
Study the user and task – domain study of the potential users and tasks
Build simulations and prototypes – a crucial tool here is the “Wizard-of-Oz” system, where the users interact with what they believe is a software agent but in fact it’s a human. This can be used to test out the architecture and user interface and experience before implementation
Iteratively test the design on users
What are the ethical issues in dialogue system design?
An important ethical issue is bias. A popular example is the Microsoft’s Tay chatbot, where after 16 hours of going live, it starts to post messages with racial comments, personal attacks, and conspiracy theories.
Another important ethical issue is privacy. Chatbots that are trained on human-human or human-machine conversation data must ensured the data are anonymise personally identifiable information.
Lastly, chatbots have issue with gender equality.