Objective and Contribution

Use unsupervised and supervised techniques to perform aspect extraction from financial microblogs. The contribution is that it is extracting a domain-specific (finance) aspects and tackling both implicit and explicit aspect extraction.

Annotation and Dataset

  1. Predefined a stock-investment taxonomy to extract both implicit and explicit aspects

  2. Created a corpus with 7 aspect classes and 32 aspect subclasses. The corpus has 368 messages, of which 218 are implicit aspects and 150 are explicit aspects


  1. Distributional Semantics Model (DSM)

  2. Supervised ML Models – XGBoost, Random Forests, SVM, and Conditional Random Fields

Distributional Semantics Model (DSM)

Essentially, it uses word embeddings to compute semantic relatedness. There are two steps:

  1. Extracting candidates. Uses morpho-syntactic patterns to select relevant Noun and Verb phrases, including modifiers such as adverbs, adjectives etc… For example, “declining revenues”.

  2. Computing relatedness with the classes. Once the candidates have been extracted, computing semantic relatedness involves comparing the candidates vectors with the aspect subclasses vectors. Multi-word candidates are combined into a single vector. Cosine similarity is computed for all possible pairwise combinations of tokens in each message and the highest score pair is retain

Supervised ML Models

This is a multi-class supervised classification problem. It involves feature engineering, machine learning algorithm optimisation, and model selection and evaluation:

  1. Feature engineering. This includes BoW (binary count, frequency count, and TFIDF), POS tagging, numericals, and predicted sentiment of entity

  2. ML algorithm optimisation. 4 ML algorithms were choses: XGBoost, Random Forests, SVM, and Conditional Random Fields. Hyperparameter was tuned using Particle Swarm Optimisation (PSO) method

  3. Model selection and evaluation. The best model among DSM and ML models were chosen (using CV) and the selected model is validated using the leave-one-out option


XGBoost scored the highest accuracy and was selected. Figure below shows the results of XGBoost on aspect classes and aspect subclasses classification as well as implicit and explicit aspect classification. XGBoost scored 71% accuracy on 7-aspect classes classification, 82% on explicit aspect classification and 35% on implicit aspect classification.


Explicit aspect classification performed well but implicit aspect classification still need more work and can be tackle with larger dataset and better feature engineering.



Data Scientist

Leave a Reply