Topic Modelling

If you have read my previous implementation series on summarisation using TFIDF, where I have built a TFIDF summariser class, you would know that I personally like to build up the skeleton of the class before diving into actual code implementation. This is because I find that it forces you to really think about AND understand the whole transformation pipeline of the text data! Having said that, below is the skeleton of the TopicModelling class that I will be building over the next few posts. It involves the following step:

  1. Create and filter word dictionary (vocabulary)
  2. Create doc2vec using the vocabulary created in step 1
  3. Pass the doc2vec to train our LDA model using the Gensim library
  4. Implement the predict method to categorise future documents
  5. Group our text data using the trained LDA model
# Defining TopicModelling class
class TopicModelling():

    def __init__(self):
        pass

    def create_filter_dictionary(self):
        pass

    def create_bow(self):
        pass

    def create_lda_model(self):
        pass

    def predict(self):
        pass

    def topic_categorisation(self):
        pass

Sentiment Analysis

The objective of this project is to cluster similar messages together. Although sentiment analysis would be useful in grouping the messages together, it’s not the primary focus (or at least not mine) of this project. And so I have decided to use a very reliable and good out-of-the-box sentiment analyser VaderSentiment.

Here’s how I have applied VaderSentiment to our processed text data:

# Load the cleaned dataset
cleaned_dataset = pd.read_csv('processed_data.csv')

# Initialising sentiment analyser
sentiment_analyser = SentimentIntensityAnalyzer()

# Apply sentiment analysis to processed message body
cleaned_dataset['sentiment'] = cleaned_dataset['message_body_processed'].apply(lambda x: sentiment_analyser.polarity_scores(x)['compound'])
Ryan

Ryan

Data Scientist

Leave a Reply