Introduction to NLP Discovery Series

NLP Discovery series is where I highlight/summarise new key development in NLP. Through this series, I aimed to stay up-to-date with the latest development in different NLP areas. This, in my opinion, goes hand-in-hand with my Learn NLP with Me series, whereby on one hand I am learning the basic theoretical foundation of new NLP concepts (bottom-up I guess…) and on the other hand, I am learning about the latest development in the field and the key challenges and potential future research that researchers are/will be working on (top-down).

Towards a Conversational Agent that can chat about… anything!


  • Conversational agents (chatbots) are generally very specialised. It doesn’t allow the users to stray too far from expected usage

  • Our goal is to eventually build a generalised conversational agent that can talk about anything a user wants!
    • Current open-domain chatbots have major flaws – they don’t make sense, inconsistent, and lack of basic knowledge about the world

Key takeaways

  • Meena – proposed model
    • A 2.6 billion parameter end-to-end neural chatbot

    • Can conduct more sensible and specific conversation that existing state of the art (SOTA) chatbots (based on SSA)

  • New Human Evaluation Metric
    • Introduced Sensibleness and Specificity Average (SSA)
      • Captures basic but important attributes of human conversation

    • Discovered that perplexity is highly correlated to SSA


  • The training objective is to minimise perplexity, the uncertainty of predicting the next word in a conversation

  • The architecture is the Evolved Transformer seq2seq
    • It has 1 Evolved Transformer encoder block
      • Process the conversation context to help Meena understand what has already been said

    • It has 13 Evolved Transformer decoder blocks
      • Uses the context vector from the encoder block to formulate an actual response

      • Through hyperparameter tuning, the authors discovered that the key to higher conversational quality is a more powerful decoder

  • Training data are organised in a tree thread format, where each reply in the thread is viewed as one conversation turn. Each training example has seven turns of context as a good balance between having long enough context to train the chatbot and fitting models within memory contraints

  • Trained on 341GB of text data (8.5x more data than GPT-2 model)

Sensibleness and Specificity Average (SSA)

  • Hired crowd-sourced workers to label model response in order to compute SSA
    • For each response, crowd workers need to answer two questions:
      • “Does it make sense?”

      • “Is it specific?”

  • For each chatbot, the author collected 1600 – 2400 individual conversation turns through around 100 conversations. The average of the fraction of responses labeled “sensible” (sensibleness of the chatbot) and “specific” (specificity of the chatbot) is the SSA score

  • Meena outperformed all existing SOTA chatbots by large margins and it’s near the human performance

Automatic metric: Perplexity

  • SSA is very labour intensive and so we need to have an automatic metric to easily and reliably evaluate our chatbots

  • The authors found that perplexity has a strong correlation with the SSA evaluation metric. The lower the perplexity, the more confident the model has on the next token generated. Perplexity represents the number of choices the model is trying to choose from when producing the next token

  • The findings show that the lower the perplexity, the better the SSA score of the model – strong correlation coefficient (R2 = 0.93)

Future research & Challenges

  • Continue to lower perplexity of chatbots to see if we can yield better chatbot performance

  • There are other attributes such as personality and factuality that are worth exploring

  • The safety and bias of the models is an important aspect of research



Data Scientist

Leave a Reply