Traditionally, TextRank is implemented using the cosine similarity to construct the similarity matrix. However, in Barrios et al. (2016), TextRank with BM25 similarity function was shown to yield the best ROUGE-score results. We can implement TextRank with BM25 as similarity function using the Gensim library as shown below. Specifically, we use summarize from gensim.summarization.summarizer.

Dependencies

import pandas as pd
from gensim.summarization import keywords
from gensim.summarization.summarizer import summarize

TextRank using Gensim

def fast_textrank(df):
    text_rank_summary = []
    error_sentences = []
    for index, row in df.iterrows():
        print(index)
        try:
            text_rank_summary.append(summarize(row[0], word_count = 450).replace('\n', ' '))
        except:
            error_sentences.append(row[0])
            text_rank_summary.append(row[0])
    return text_rank_summary, error_sentences

fast_textrank() iterates through each row of the dataframe and summarises the articles using summarize(). df.iterrows() is great when you are dealing with millions of datapoints as my understanding is that it doesn’t keep each row datapoint in memory. Results are saved to text_rank_summary array.

TextRank applied to CNN/Daily Mail dataset

# Train set
df_art_train = read_data('train.art.txt')
df_art_train = df_art_train[df_art_train[0] != '\n']
train_textrank_summary, error_sent_train = fast_textrank(df_art_train)
df_art_train['textrank'] = train_textrank_summary
df_art_train.reset_index(inplace = True, drop = True)

Once we have created the fast_textrank() function, we can easily apply it to the common summarisation dataset, CNN/Daily Mail, as shown above.

Ryan

Ryan

Data Scientist

Leave a Reply