ArXiv Simple Summarisation using TextRank

1. Import dependencies & Read data files

In [1]:
import json
import numpy as np
import pandas as pd
from gensim.summarization import keywords
from gensim.summarization.summarizer import summarize
In [3]:
data_folder = './arxiv-dataset/'
In [4]:
test_data = []
with open(data_folder + 'test.txt') as testData:
    for line in testData:
        test_data.append(json.loads(line.rstrip('\n')))
In [5]:
test_df = pd.DataFrame(test_data)
In [6]:
test_df.head()
Out[6]:
article_id article_text abstract_text labels section_names sections
0 1009.3123 [for about 20 years the problem of properties … [<S> the short – term periodicities of the dai… None [introduction, methods of periodicity analysis… [[for about 20 years the problem of properties…
1 1512.09139 [it is believed that the direct detection of g… [<S> we study the detectability of circular po… None [introduction, stokes parameters for plane gra… [[it is believed that the direct detection of …
2 0909.1602 [as a common quantum phenomenon , the tunnelin… [<S> starting from the wkb approximation , a n… None [[sec:intro]introduction, [sec:formalism]forma… [[as a common quantum phenomenon , the tunneli…
3 1512.03812 [for the hybrid monte carlo algorithm ( hmc)@x… [<S> we study a novel class of numerical integ… None [introduction, geometric integrators for hmc, … [[for the hybrid monte carlo algorithm ( hmc)@…
4 1512.09024 [recently it was discovered that feynman integ… [<S> new methods for obtaining functional equa… None [introduction, deriving functional equations f… [[recently it was discovered that feynman inte…

2. Quick concatenation of article text and abstract text

In [ ]:
test_df['article_text'] = test_df['article_text'].apply(lambda x: " ".join(x))
In [34]:
test_df['abstract_text'] = test_df['abstract_text'].apply(lambda x: " ".join(x))

3. TextRank summarisation using Gensim

In [45]:
test_df['textrank_summary'] = test_df['article_text'].apply(lambda x: summarize(x, word_count = 1000))
In [46]:
test_df.head()
Out[46]:
article_id article_text abstract_text labels section_names sections textrank_summary
0 1009.3123 for about 20 years the problem of properties o… <S> the short – term periodicities of the dail… None [introduction, methods of periodicity analysis… [[for about 20 years the problem of properties… they applied the same power spectrum method as…
1 1512.09139 it is believed that the direct detection of gr… <S> we study the detectability of circular pol… None [introduction, stokes parameters for plane gra… [[it is believed that the direct detection of … the main target of ptas is the stochastic grav…
2 0909.1602 as a common quantum phenomenon , the tunneling… <S> starting from the wkb approximation , a ne… None [[sec:intro]introduction, [sec:formalism]forma… [[as a common quantum phenomenon , the tunneli… for most of the potential barriers , the penet…
3 1512.03812 for the hybrid monte carlo algorithm ( hmc)@xc… <S> we study a novel class of numerical integr… None [introduction, geometric integrators for hmc, … [[for the hybrid monte carlo algorithm ( hmc)@… for the hybrid monte carlo algorithm ( hmc)@xc…
4 1512.09024 recently it was discovered that feynman integr… <S> new methods for obtaining functional equat… None [introduction, deriving functional equations f… [[recently it was discovered that feynman inte… these methods are based on algebraic relations…
In [53]:
test_df.to_csv('arxiv_with_textrank.csv', index = False)

4. Compute ROUGE scores

  • Remember to remove special tokens before computing ROUGE scores
In [9]:
from rouge_score import rouge_scorer
In [10]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer = True)
In [21]:
def remove_pad(text):
    tmp = text.replace('<S>', '')
    tmp = tmp.replace('</S>', '')
    return tmp.strip()
In [31]:
rouge_scores = []

for i in range(len(test_df)):
    processed_ground_truth = remove_pad(test_df['abstract_text'][i])
    rouge_score = scorer.score(processed_ground_truth, test_df['textrank_summary'][i])
    rouge_scores.append(rouge_score)
In [36]:
len(rouge_scores)
Out[36]:
6465
In [38]:
rouge_1_f1 = [i['rouge1'][2] for i in rouge_scores]
rouge_2_f1 = [i['rouge2'][2] for i in rouge_scores]
rouge_L_f1 = [i['rougeL'][2] for i in rouge_scores]
In [52]:
print(np.mean(rouge_1_f1))
print(np.mean(rouge_2_f1))
print(np.mean(rouge_L_f1))
0.21853840564436197
0.10080142921450896
0.12492352437342309

The current SOTA results on arXiv dataset is by Pegasus which has:

arxiv (rouge_1, rouge_2, rouge_L) –> 44.70/17.27/25.80

We are far from the SOTA results lool but at least we get to read in the arXiv dataset which I have always wanted to explore!

Ryan

Ryan

Data Scientist

Leave a Reply