Create LDA Model

Up to this point, we have created the dictionary (vocab) as well as turning all our documents into their respective numerical representations. We will now use the gensim package to create and train our LDA model. There are many parameters but we will only feed in the following information:

  1. Training corpus (our doc2vec)

  2. Number of topics (self.no_topics)

  3. Dictionary that maps index to words (self.dictionary)

  4. Passes (number of passes through our training corpus)

def create_lda_model(self):
    self.ldamodel = models.ldamodel.LdaModel(self.bow_corpus, num_topics = self.no_topics, id2word = self.dictionary, passes = 15)
    print('Saved LDA model as model.gensim')
    self.ldamodel.save('model.gensim')

    return self.ldamodel

Predict

This is a simple method where we want to use our trained LDA model to categorise new unseen documents. In order to do that, we would:

  1. Process the unseen document

  2. Convert text to its numerical representation using doc2bow

  3. Apply our trained LDA model to get document topics distribution

def predict(self, processed_new_doc):
    new_doc_bow = self.dictionary.doc2bow(processed_new_doc)
    
    return self.ldamodel.get_document_topics(new_doc_bow)

Topic Categorisation

This method takes our trained LDA model and apply it to our text corpus. It will categorise each data point in our text corpus into one of the topics (we have set the number of topics to be 5 at the beginning). It will also extract all the keywords that belong to a particular topic. The output of this method is that for each data point, we would have a tuple consists of the topic number, it’s contribution to the data point, and the keywords associated with the topic.

def topic_categorisation(self, ldamodel = None, corpus = None):
    topic_data = []
    
    if ldamodel is None:
        ldamodel = self.ldamodel
    
    if corpus is None:
        corpus = self.bow_corpus

    for i, row in enumerate(ldamodel[corpus]):
        # order by topic probabilities
        row = sorted(row, key = lambda x: (x[1]), reverse = True)
        for j, (topic_num, contribution) in enumerate(row):
            if j == 0: # Most dominant topic
                word_list = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, word_contribution in word_list])
                topic_data.append((int(topic_num), round(contribution, 4), topic_keywords))
            else:
                break
    
    return topic_data

Final TopicModelling Class

# Defining TopicModelling class
class TopicModelling():

    def __init__(self, data):
        self.no_topics = 5
        self.no_below = 15
        self.no_above = 0.5 # percentage
        self.keep_top_n = 100000

        self.dictionary = None
        self.bow_corpus = None
        self.ldamodel = None

        self.data = data.apply(lambda x: x.split())

    def create_filter_dictionary(self):
        
        # Creating dictionary
        self.dictionary = corpora.Dictionary(self.data)
        
        count = 0
        for k, v in self.dictionary.iteritems():
            print(k, v)
            count += 1
            if count > 10:
                break
        
        # Filtering dictionary
        self.dictionary.filter_extremes(self.no_below, self.no_above, self.keep_top_n)
        
        # Saving dictionary
        print('Saved dictionary as dictionary.gensim')
        self.dictionary.save('dictionary.gensim')

    def create_bow(self):
        self.bow_corpus = [self.dictionary.doc2bow(doc) for doc in self.data]
        print('Saved bow corpus as bow_corpus.pkl')
        pickle.dump(self.bow_corpus, open('bow_corpus.pkl', 'wb'))
    def create_lda_model(self):
        self.ldamodel = models.ldamodel.LdaModel(self.bow_corpus, num_topics = self.no_topics, id2word = self.dictionary, passes = 15)
        print('Saved LDA model as model.gensim')
        self.ldamodel.save('model.gensim')

        return self.ldamodel

    def predict(self, processed_new_doc):
        new_doc_bow = self.dictionary.doc2bow(processed_new_doc)
        
        return self.ldamodel.get_document_topics(new_doc_bow)

    def topic_categorisation(self, ldamodel = None, corpus = None):
        topic_data = []
        
        if ldamodel is None:
            ldamodel = self.ldamodel
        
        if corpus is None:
            corpus = self.bow_corpus

        for i, row in enumerate(ldamodel[corpus]):
            # order by topic probabilities
            row = sorted(row, key = lambda x: (x[1]), reverse = True)
            for j, (topic_num, contribution) in enumerate(row):
                if j == 0: # Most dominant topic
                    word_list = ldamodel.show_topic(topic_num)
                    topic_keywords = ", ".join([word for word, word_contribution in word_list])
                    topic_data.append((int(topic_num), round(contribution, 4), topic_keywords))
                else:
                    break
        
        return topic_data
Ryan

Ryan

Data Scientist

Leave a Reply