I completed my masters thesis on abstractive text summarisation. As part of my thesis, I have to compare the different content selection techniques; to evaluate how good they are in selecting salient information from source documents. One of my content selection techniques is TFIDF. TFIDF is a good baseline to compare against other more advanced models and it’s a very quick process to compute. In this blog post, I will explain how I have applied TFIDF to summarisation (extractive).
Term Frequency-Inverse Document Frequency (TFIDF) is a lexical frequencies method which extends the word probability method that measures the importance of a word relative to the document and the entire corpus of documents. In order to compute the TFIDF weight of a word, we need, in addition to the word frequency within a document, the number of documents that contain the word. In our case, seeing as we are dealing with the summarisation of a single document using TFIDF, our ”documents” are ”sentences” and we are computing the word frequency of a particular word per sentence. Term frequency measures how often a word appears within a single sentence. The more times the word appears within the sentence, the more likely that the word is of importance. IDF measures how important a particular word is to the sentence it belongs to by taking into account how often the word appears in other sentences. The word ”and” would appear very frequently in almost every single sentence and so the word isn’t really useful and important whereas the word ”NLP” might only appear in few sentences and so the IDF(NLP) is high. This means that if the word ”NLP” appears in a particular sentence, NLP is an important word/topic within that sentence. The higher the TFIDF weight for a particular word, the more important that word is to the sentence it belongs to. We can compute the TFIDF weight of the word w as follows: