Objective and Contribution

Introduced Contrastive Attention (CAt), a single-head Radial Basis Function (RBF) kernel attention mechanism that uses only in-domain word embeddings and POS tagger. The model improves previous performance and interpretability.

Contrastive Attention (CAt)

There are four steps to CAt:

  1. Training in-domain word embeddings

  2. Aspect term extraction

  3. Aspect Selection using CAt

  4. Aspect Labelling

Training in-domain word embeddings

To extract and label aspects, we need to use in-domain word embeddings. We would use a set of in-domain documents to train our word embeddings using word2vec.

Aspect term extraction

Here, we found that the most frequent nouns are good aspects and so we use a POS tagger to recognise nouns. We use the spaCy library for tokenisation and POS tagging.

Aspect Selection using CAt

Here, we begin with the vanilla attention mechanism where we compute the probability distribution using the encoded sentence matrix and aspect vector. We would multiply this probability distribution with the sentence matrix for each aspect to compute the weighted sentence summary.

However, we observed that when we changed aspect vector into an aspect matrix, we were able to produce different attention distribution for each aspect, which lead to different weighted sentence summaries. To effectively integrate these set of query vectors into a single attention distribution, we use CAt, a single-head RBF kernel attention mechanism. Given the RBF kernel, matrix S, and a set of aspect vectors A, we can compute attention as shown below. The attention of a single word is the sum of RBF responses of all vectors in A divided by the sum of RBF responses of the vectors to all vectors in S. This gives us a probability distribution of words in the sentence where words that are similar to the aspects get higher score.

Aspect Labelling

We compute the cosine similarity between the weighted document vector d and label vector and label each document with their closest label.


Our evaluation datasets are: CitySearch, SemEval 2014 Restaurant and SemEval 2015 Restaurant. We only tackle sentences with exactly one aspect and focuses on three different labels: FOOD, SERVICE, and AMBIENCE. The statistics of the final datasets are shown below.

We trained our models on SemEval 2014 and 2015 datasets and test our model on CitySearch dataset. We have 4 baseline models:

  1. W2VLDA. Topic modelling that has a bias towards words that are similar to aspects

  2. SERBM. A type of Boltzmann Machine

  3. ABAE. An autoencoder that uses global context and aspect vectors to compute attention distributions of words in sentences

  4. AE-CSA. A hierarchical model that’s similar to ABAE

  5. Average word embeddings

  6. Regular attention mechanism


The results are displayed in table 3. Due to class imbalance (60% FOOD), the F-scores does not give a good representative of the model performance and so we computed the weighted macro-average F1 scores which are displayed in table 2. Our model outperformed all baseline models in weighted macro-average F1 score. It also achieved the best performance in two of the three individual aspects. The average word embeddings performed relatively well despite no attention or aspects knowledge, indicating that aspect knowledge is probably not needed to perform well.

Ablation Study

We performed ablation study on each component of our model: POS tagging, in-domain word embeddings, and data volume. For POS tagging, we looked at the effect of selecting the most frequent words and selecting nouns based on adjective-noun co-occurrence as an alternative. The performance for both alternatives underperformed the most frequent nouns with selecting the most frequent words experiencing the biggest drop in performance of -21.9 in F1 score.

We explored using pretrained GloVe embeddings as an alternative and saw a large drop in performance by -32 in F1 score. In terms of how much in-domain data we need to train a good in-domain word embeddings, we increase the training data in 10% increments, train different word embeddings, and measure the performance. Results are displayed below and show that we need approx. 260K sentences which it’s a relatively low amount.

Error Analysis

The table below showcase a list of error types of our best performing model. We observe that our models would perform poorly with out-of-vocabulary (OOV) or low frequency words. In addition, our model is based on word similarity and so homonyms words might cause problems. We restricted our aspect terms to only nouns and so our model would miss aspects expressed in verbs. Lastly, discourse and implicature can lead to errors where the model doesn’t have enough world knowledge to infer common sense. Given these errors, we believe that our model would perform poorly in domains where aspects are more implicit.

Conclusion and Future Work

Potential future work could involve addressing the limitations found in our error analysis. We can also explore our method on datasets with different domains and languages. In addition, we could replace the regular attention mechanism with CAt to see how it affects the supervised models.



Data Scientist

Leave a Reply