There are three main stages to lang.ai’s intent discovery process:

  1. Preprocessing

  2. Discovery

  3. Postprocessing

The preprocessing is a simple pipeline of tokenisation, lemmatisation, and automatic correction. The automatic correction is used to detect and replace misspelled words.

The discovery stage is where we extract the main intents and features in the dataset based on Information Theory. This stage has three sub-stages:

  1. Intent induction

  2. Semantic relation and clustering

  3. Features extraction and deeper intents

The intent induction is where the unsupervised algorithm comes in to take the raw input text and output a set of text categories we would considered to be intents. The algorithm works as follows:

  1. Calculate a set of semantic edges where each edge is a pair of words considered to entail more information

  2. Signatures are defined as cliques of words extracted from input documents using semantic edge relationship

  3. Signatures are then organised in a general DAG and the general ones are regarded as intents

The clustering is performed once the intents have been determined. This is to provide a higher level categorisation of intents. We also go a step further to do features extraction where our objective is to determine the features that are specific to a particular intent. This detection is similar to the intent induction except at a more granular level.

The last stage is the postprocessing. This is the stage where we add more information to improve the accuracy and interpretability of our discovery. One of the steps is knowledge augmentation. This involves enriching the intents discovered using external information such as a knowledge base. Our intent induction follows the adaptive learning framework where we allow our models to perform better and more diverse inference on text categories over time.

Ryan

Ryan

Data Scientist

Leave a Reply