Semantic textual similarity (STS) is an NLP task that deals with finding how similar two pieces of text are. Some of the STS applications that I have seen include data augmentation, questions suggestion in chatbots, and duplication detection. I came across a Google AI article on advances in STS. Below’s the summary of what I have learned.
The basic method for computing STS between two texts is to compute the cosine similarity between two sentence representations. In this case, there are multiple ways to compute sentence representations such as averaging out pre-trained word embeddings (Word2Vec, GloVe) or using sentence embeddings (InferSent).
Learning STS from Conversations
However, Google AI research introduce a new way of computing sentence representations for STS. The idea is that rather than computing STS based on word similarity between sentences, we should be computing STS based on how similar the distribution of responses is between two sentences. The idea is that sentences that are semantically similar should have a similar distribution of responses. The example given in the paper is shown below.
Even though “How are you?” and “How old are you?” have almost identical words, they have different meanings and responses.
Universal Sentence Encoder
The Universal Sentence Encoder is an encoder-only model that performs multitask training, jointly training different tasks with a model similar to skip-thought that predicts sentences surrounding a given text. Training time is greatly reduced due to it being an encoder-only architecture and performance on a number of transfer tasks are preserved, which include STS classification. The idea is to build a universal (single encoder) model that can do many NLP tasks such as clustering, paraphrase detection, and custom text classification.