Import Dependencies

import pickle

import gensim
from gensim import corpora, models

The Constructor

It takes in the whole corpus of our text dataset as input data. Within the constructor, we set up all the hyperparameters as well as initialise any variables that we might need for the class methods. For topic modelling, we need to pre-define the number of topics. In the code snippet below, we chose 5 to be the number of topics (self.no_topics).

The next three hyperparameters is for filtering the dictionary once it has been created. We want to filter out extreme entries in our dictionary. Our hyperparameters keep any tokens that appeared in at least 15 sentences, no more than 50% of the total number of sentences, and keep the first 100000 most frequent tokens.

Lastly, the constructor tokenise the input text data, ready to be use for other methods.

def __init__(self, data):
    self.no_topics = 5
    self.no_below = 15
    self.no_above = 0.5 # percentage
    self.keep_top_n = 100000

    self.dictionary = None
    self.bow_corpus = None
    self.ldamodel = None

    self.data = data.apply(lambda x: x.split())

Create and Filter Dictionary (Vocab)

This method takes in self.data (list of words from the entire text corpus) and create a dictionary containing the tokens and its corresponding frequencies. It will then filter the dictionary as described in the constructor section. Lastly, it will save the dictionary as dictionary.gensim.

def create_filter_dictionary(self):
    
    # Creating dictionary
    self.dictionary = corpora.Dictionary(self.data)
    
    count = 0
    for k, v in self.dictionary.iteritems():
        print(k, v)
        count += 1
        if count > 10:
            break
    
    # Filtering dictionary
    self.dictionary.filter_extremes(self.no_below, self.no_above, self.keep_top_n)
    
    # Saving dictionary
    print('Saved dictionary as dictionary.gensim')
    self.dictionary.save('dictionary.gensim')

Create Doc2Vec

This method takes the dictionary you have created in the previous method (and saved in self.dictionary) and apply it to every single sentence in the text corpus. It converts text sentences into its respective numerical representation! This is then saved in bow_corpus.pkl.

def create_bow(self):
    self.bow_corpus = [self.dictionary.doc2bow(doc) for doc in self.data]
    print('Saved bow corpus as bow_corpus.pkl')
    pickle.dump(self.bow_corpus, open('bow_corpus.pkl', 'wb'))

Current TopicModelling Class

# Defining TopicModelling class
class TopicModelling():

    def __init__(self, data):
        self.no_topics = 5
        self.no_below = 15
        self.no_above = 0.5 # percentage
        self.keep_top_n = 100000

        self.dictionary = None
        self.bow_corpus = None
        self.ldamodel = None

        self.data = data.apply(lambda x: x.split())

    def create_filter_dictionary(self):
        
        # Creating dictionary
        self.dictionary = corpora.Dictionary(self.data)
        
        count = 0
        for k, v in self.dictionary.iteritems():
            print(k, v)
            count += 1
            if count > 10:
                break
        
        # Filtering dictionary
        self.dictionary.filter_extremes(self.no_below, self.no_above, self.keep_top_n)
        
        # Saving dictionary
        print('Saved dictionary as dictionary.gensim')
        self.dictionary.save('dictionary.gensim')

    def create_bow(self):
        self.bow_corpus = [self.dictionary.doc2bow(doc) for doc in self.data]
        print('Saved bow corpus as bow_corpus.pkl')
        pickle.dump(self.bow_corpus, open('bow_corpus.pkl', 'wb'))

    def create_lda_model(self):
        pass

    def predict(self):
        pass

    def topic_categorisation(self):
        pass
Ryan

Ryan

Data Scientist

Leave a Reply