This is the first multi-series blog post that showcase step-by-step how to build a TFIDF class for summarisation. The whole process is illustrated in the figure below. In this first post, I will go through how to setup/build a python class for TFIDF.

Constructing a Python Class

A lot of people struggle to build a class in python because they either don’t have a clear understanding of the purpose/use case of the class or they are overwhelmed by the thoughts of building the whole class in one go with fully written out methods and attributes. To solve the first problem, clearly identify the use case of the class. In this mini-series, we are trying to build a TFIDF class for summarisation and so the use case of the class would be to take in the raw article and output a short summary that contains the salient information.

Once you have clearly identified the purpose of the class and know the input and output of the class, you can start to brainstorm the steps in the middle. Once you have identified the steps in the middle, you should have something that looks like the figure above.

Now comes the implementation part. What I usually love to do, as the first step, is to create the class with all the middle steps as methods but keep them empty as shown below! This way, you have essentially created a skeleton for your class. Don’t worry if you think you haven’t included all the necessary methods. You can always add more methods later on but this would give you a strong start to implementing the whole class. Also in this way, you can build one method at a time and test each one out as you go along to ensure that they are behaving as you would have expected.

class TFIDF_single_doc_extractor():
    
    def __init__(self):
        pass
    
    def read_and_sent_tokenise(self):
        pass
    
    def create_frequency_matrix(self):
        pass
    
    def create_tf_matrix(self):
        pass

    def create_sentences_per_words(self):
        pass

    def create_idf_matrix(self):
        pass
    
    def create_tf_idf_matrix(self):
        pass

    def score_sentences(self):
        pass
    
    def find_average_sentence_score(self):
        pass
        
    def generate_summary_by_avg(self):
        pass
    
    def generate_summary_by_top_sentences(self):
        pass

if __name__ == "__main__":
    
    TfidfClass = TFIDF_single_doc_extractor()
    print("Initated TFIDF class")
Ryan

Ryan

Data Scientist

Leave a Reply