Today, we will be implementing functions to compute the term frequency matrix. Term frequency measures how many times a particular word has appeared in a document. Since we are dealing with summarising single articles, it means that our definition of “document” means sentences within the article.

To achieve this, we would need to read in raw articles, process the text, compute the frequency matrix, and use that to finally compute our term frequency matrix.

Read and/or sentence tokenisation

def read_and_sent_tokenise(self):
        if self.isText == False:
            text = open(self.document , encoding='utf-8')
            text = text.read()
        else:
            text = self.document
        
        self.sentences = sent_tokenize(text)
        self.total_sentences = len(self.sentences)
        
        return self.sentences

We have written the class to accept two types of input: the article file or the article text. This means that the class would accept a .txt file that contains the article or a long string of the article. This is govern by the self.isText attribute and when you initiate the class, you would need to specify whether the “document” you have given us is a file or a long string of text.

Once we have received the article, the first thing we will do is split the article into sentences using nltk.sent_tokenize(). This would return an array of sentences which we saved in self.sentences. For each article, we also computed the number of sentences within the article using len function on self.sentences.

Compute Frequency Matrix

def create_frequency_matrix(self):
        _stopwords = set(stopwords.words('english'))
        wordnet = WordNetLemmatizer()
        
        for sentence in self.sentences:
            freq_table = {}
            words = tokenizer.tokenize(sentence)
            for word in words:
                word = word.lower()
                word = wordnet.lemmatize(word)
                if word in _stopwords:
                    continue
                
                if word in freq_table:
                    freq_table[word] += 1
                else:
                    freq_table[word] = 1
            
            self.freq_matrix[sentence] = freq_table

In this class method, we are essentially creating a frequency dictionary of each word. This means that, by executing this method, we would have a dictionary whereby each word is map to how frequent it has appeared WITHIN the sentence. Remember that each sentence is a document! This method will compute the frequency table for each sentence independently.

Before we compute the frequency tables, we would need to preprocess (clean) the text. We implemented simple preprocessing such as lowercasing each word, removing stopwords, and lemmatisation. Lemmatisation ensures that words like “run”, “running”, and “ran” are all captured under the same key within the dictionary, in this case, it would all be captured under the key “run”.

Each frequency table (for each sentence) is mapped to its respective sentence and the results are saved in self.freq_matrix attribute.

Compute Term Frequency (TF) Matrix

def create_tf_matrix(self):
        for sentence, freq_table in self.freq_matrix.items():
            tf_table = {}
            no_words_in_sentence = len(freq_table)
            for word, count in freq_table.items():
                tf_table[word] = count / no_words_in_sentence
            
            self.tf_matrix[sentence] = tf_table

As mentioned above, to compute the term frequency (TF) of a word, we would divide the number of times the word appeared in the sentence by the number of words in the sentence. For example, TF(dog) in the sentence “I love dog” would equates to 1 / 3.

This method would go through the self.freq_matrix dictionary and compute the term frequency of each word in all the sentences (documents) and save the results in self.tf_matrix attribute.

What Our Class Looks Like Now!

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.sent_tokenize import sent_tokenize

tokenizer = RegexpTokenizer(r'\w+')

class TFIDF_single_doc_extractor():
    
    def __init__(self, document, isText = False):
        self.document = document
        self.isText = isText
        self.sentences = None
        self.total_sentences = None

        self.freq_matrix = {}
        self.tf_matrix = {}
    
    def read_and_sent_tokenise(self):
        if self.isText == False:
            text = open(self.document , encoding='utf-8')
            text = text.read()
        else:
            text = self.document
        
        self.sentences = sent_tokenize(text)
        self.total_sentences = len(self.sentences)
        
        return self.sentences
    
    def create_frequency_matrix(self):
        _stopwords = set(stopwords.words('english'))
        wordnet = WordNetLemmatizer()
        
        for sentence in self.sentences:
            freq_table = {}
            words = tokenizer.tokenize(sentence)
            for word in words:
                word = word.lower()
                word = wordnet.lemmatize(word)
                if word in _stopwords:
                    continue
                
                if word in freq_table:
                    freq_table[word] += 1
                else:
                    freq_table[word] = 1
            
            self.freq_matrix[sentence] = freq_table
    
    def create_tf_matrix(self):
        for sentence, freq_table in self.freq_matrix.items():
            tf_table = {}
            no_words_in_sentence = len(freq_table)
            for word, count in freq_table.items():
                tf_table[word] = count / no_words_in_sentence
            
            self.tf_matrix[sentence] = tf_table


    def create_sentences_per_words(self):
        pass

    def create_idf_matrix(self):
        pass
    
    def create_tf_idf_matrix(self):
        pass

    def score_sentences(self):
        pass
    
    def find_average_sentence_score(self):
        pass
        
    def generate_summary_by_avg(self):
        pass
    
    def generate_summary_by_top_sentences(self):
        pass

if __name__ == "__main__":
    
    TfidfClass = TFIDF_single_doc_extractor()
    print("Initated TFIDF class")
Ryan

Ryan

Data Scientist

Leave a Reply