Traditionally, TextRank is implemented using the cosine similarity to construct the similarity matrix. However, in Barrios et al. (2016), TextRank with BM25 similarity function was shown to yield the best ROUGE-score results. We can implement TextRank with BM25 as similarity function using the Gensim library as shown below. Specifically, we use summarize from gensim.summarization.summarizer.
import pandas as pd from gensim.summarization import keywords from gensim.summarization.summarizer import summarize
TextRank using Gensim
def fast_textrank(df): text_rank_summary =  error_sentences =  for index, row in df.iterrows(): print(index) try: text_rank_summary.append(summarize(row, word_count = 450).replace('\n', ' ')) except: error_sentences.append(row) text_rank_summary.append(row) return text_rank_summary, error_sentences
fast_textrank() iterates through each row of the dataframe and summarises the articles using summarize(). df.iterrows() is great when you are dealing with millions of datapoints as my understanding is that it doesn’t keep each row datapoint in memory. Results are saved to text_rank_summary array.
TextRank applied to CNN/Daily Mail dataset
# Train set df_art_train = read_data('train.art.txt') df_art_train = df_art_train[df_art_train != '\n'] train_textrank_summary, error_sent_train = fast_textrank(df_art_train) df_art_train['textrank'] = train_textrank_summary df_art_train.reset_index(inplace = True, drop = True)
Once we have created the fast_textrank() function, we can easily apply it to the common summarisation dataset, CNN/Daily Mail, as shown above.