News Articles Ingestion Pipeline

Over the last few days, we have worked on fake news classification, topic modelling and sentiment analysis at both the document and sentence level, and lastly entity extractions. Today’s blog post aims to connect all the techniques together to create our news articles ingestion pipeline where we feed in the latest news articles and output different results such as the summary, the topics, and the key entities of the news article. The whole pipeline is shown in the figure below. Note that we have used baseline techniques for all the NLP tasks below. The key is to build a working application first and iterate it over time!

Article Ingestion

def article_ingestion(doc) showcase the full pipeline of how we take an unseen news article and output many different features such as:

  • Fake news classification
  • Duplicate content checks
  • Summarisation
  • Topic Modelling (document level)
  • Topic Modelling (sentence level)
  • Entities extraction

Import dependencies

In [ ]:
import pandas as pd
from helper_functions import *

Sampled another 10000 articles from our dataset

In [ ]:
data = pd.read_csv('processed_news_data.csv')
In [ ]:
new_data_df = data[10000:20000]
In [ ]:
new_data_df

Our article ingestion pipeline

In [ ]:
def article_ingestion(doc):
    results = {}
    
    # Fake news detection
    combined_text = doc['title'] + ' ' + doc['text']
    results['is_fake_news'] = is_fake_news(combined_text)[0]
    
    # Duplicate content checks
    similarities = get_similar_doc(doc['text'])
    results['is_duplicate_content'] = is_duplicate_content(similarities, 1)
    
    # Summarisation
    results['lead3_summary'] = doc['lead3_summary']
    
    # Topic modelling - document level
    dominant_topic_breakdown = topic_modelling_document(doc['id'], doc['title'], doc['text'])
    
    # Sentiment analysis - document level
    results['sentiment'] = sentiment_score_compound(doc['text'])
    
    # Topic modelling and Sentiment analysis - sentence level
    sentence_data = sentence_level_topic_modelling(doc['id'], doc['title'], doc['text'])
    
    # entities extraction
    post_spacy_doc = nlp(doc['text'])
    results['key_entities'], results['entities'] = key_entities_extraction_pipeline(post_spacy_doc, 10)    
    
    return results, dominant_topic_breakdown, sentence_data
In [20]:
doc = new_data_df.iloc[50]; doc
Out[20]:
id                        932b8ce689301ea4c898f2354ed6093170ab3f4a
title            CONSOL Coal Resources Schedules Fourth Quarter...
url              http://www.cnbc.com/2018/01/08/pr-newswire-con...
social           {'gplus': {'shares': 0}, 'pinterest': {'shares...
text             CANONSBURG, Pa., Jan. 8, 2018 /PRNewswire/ -- ...
spam_score                                                   0.107
published                                2018-01-09 01:00:00+02:00
persons                                                         []
locations                                                       []
organizations                                                   []
lead3_summary    CANONSBURG, Pa., Jan. 8, 2018 /PRNewswire/ -- ...
year                                                          2018
month                                                            1
day                                                              9
num_words        ['CANONSBURG', ',', 'Pa.', ',', 'Jan.', '8', '...
Name: 10050, dtype: object
In [21]:
results, dominant_topic_breakdown, sentence_data = article_ingestion(doc)
In [22]:
results
Out[22]:
{'is_fake_news': 1,
 'is_duplicate_content': False,
 'lead3_summary': "CANONSBURG, Pa., Jan. 8, 2018 /PRNewswire/ -- CONSOL Coal Resources LP (NYSE: CCR) will issue its fourth quarter earnings release before the market opens on Tuesday, February 6, 2018. This will be followed by a conference call hosted by members of the management team at 11:00 a.m. Eastern Time. The webcast will be accessible on the 'Investor Relations' page of the company's website, www.ccrlp.com .",
 'sentiment': 0.9545,
 'key_entities': {'CONSOL': ['ORG', 10],
  'CONSOL Coal Resources': ['ORG', 5],
  'CCR': ['ORG', 3],
  'CONSOL Energy Inc': ['ORG', 3],
  'Pennsylvania': ['GPE', 3],
  'NYSE': ['ORG', 3],
  'CONSOL Coal Resources LP': ['ORG', 2],
  'CEIX': ['ORG', 2],
  'CANONSBURG': ['GPE', 1],
  'Zach Smith': ['PERSON', 1]},
 'entities': [('CANONSBURG', 0, 10, 'GPE'),
  ('Pa.', 12, 15, 'GPE'),
  ('Jan. 8, 2018', 17, 29, 'DATE'),
  ('CONSOL Coal Resources LP', 46, 70, 'ORG'),
  ('NYSE', 72, 76, 'ORG'),
  ('CCR', 78, 81, 'ORG'),
  ('fourth quarter', 98, 112, 'DATE'),
  ('Tuesday, February 6, 2018', 157, 182, 'DATE'),
  ('11:00 a.m. Eastern Time', 271, 294, 'TIME'),
  ("the 'Investor Relations'", 330, 354, 'ORG'),
  ('at least 30 days', 450, 466, 'DATE'),
  ('Beginning first quarter of 2018', 484, 515, 'DATE'),
  ('CONSOL Energy Inc.', 589, 607, 'ORG'),
  ('NYSE', 609, 613, 'ORG'),
  ('CEIX', 615, 619, 'ORG'),
  ('75%', 647, 650, 'PERCENT'),
  ('the Pennsylvania Mining Complex', 663, 694, 'ORG'),
  ('1', 728, 729, 'CARDINAL'),
  ('1-412-902-4112', 777, 791, 'CARDINAL'),
  ('CONSOL Coal Resources', 838, 859, 'ORG'),
  ('CONSOL Coal Resources', 886, 907, 'ORG'),
  ('CONSOL Energy Inc.', 966, 984, 'ORG'),
  ('NYSE', 986, 990, 'ORG'),
  ('CEIX', 992, 996, 'ORG'),
  ('CONSOL', 1035, 1041, 'ORG'),
  ('Pennsylvania', 1070, 1082, 'GPE'),
  ('25%', 1105, 1108, 'PERCENT'),
  ('CONSOL', 1162, 1168, 'ORG'),
  ('Pennsylvania', 1171, 1183, 'GPE'),
  ('three', 1218, 1223, 'CARDINAL'),
  ('Information', 1275, 1286, 'ORG'),
  ('Mitesh Thakkar', 1349, 1363, 'ORG'),
  ('724', 1369, 1372, 'CARDINAL'),
  ('Zach Smith', 1414, 1424, 'PERSON'),
  ('724', 1430, 1433, 'CARDINAL'),
  ('485-4017', 1435, 1443, 'CARDINAL'),
  ('CONSOL Coal Resources LP', 1657, 1681, 'ORG'),
  ('CONSOL Energy Inc.', 1686, 1704, 'ORG')]}
In [23]:
dominant_topic_breakdown
Out[23]:
Dominant_Topic Topic_Perc_Contrib Keywords text doc_id
0 2 0.612878 conference, company, release, february, financ… CANONSBURG, Pa., Jan. 8, 2018 /PRNewswire/ — … 932b8ce689301ea4c898f2354ed6093170ab3f4a
In [24]:
sentence_data
Out[24]:
Sentiment Topics Topics_Confidence_Level
id title sentences
932b8ce689301ea4c898f2354ed6093170ab3f4a CONSOL Coal Resources Schedules Fourth Quarter 2017 Earnings Release and Conference Call CANONSBURG, Pa., Jan. 8, 2018 /PRNewswire/ — CONSOL Coal Resources LP (NYSE: CCR) will issue its fourth quarter earnings release before the market opens on Tuesday, February 6, 2018. 0 2 84.06%
This will be followed by a conference call hosted by members of the management team at 11:00 a.m. Eastern Time. 0 2 86.19%
The webcast will be accessible on the ‘Investor Relations’ page of the company’s website, www.ccrlp.com . 0 2 86.19%
An archive of the webcast will be available for at least 30 days after the event. 0 2 80.67%
Beginning first quarter of 2018, we expect to hold a combined earnings conference call with our sponsor, CONSOL Energy Inc. (NYSE: CEIX), which owns the remaining 75% interest in the Pennsylvania Mining Complex. 0.6249 2 47.74%
Participant dial in (toll free) 1-855-656-0928\nParticipant international dial in 1-412-902-4112\nParticipants should ask to be joined into the CONSOL Coal Resources earnings conference call. 0.5106 2 49.17%
CONSOL Coal Resources is a growth-oriented master limited partnership formed by CONSOL Energy Inc. (NYSE: CEIX) to manage and further develop all of CONSOL’s active coal operations in Pennsylvania. 0.4404 16 70.56%
Its assets include a 25% undivided interest in, and operational control over, CONSOL’s Pennsylvania mining complex, which consists of three underground mines and related infrastructure. 0.5719 26 28.72%
More Information is available on our website www.ccrlp.com\nContacts:\nInvestor: Mitesh Thakkar, at (724) 485-3133\nmiteshthakkar@ccrlp.com\nMedia: Zach Smith, at (724) 485-4017\nzacherysmith@ccrlp.com\nView original content with multimedia: http://www.prnewswire.com/news-releases/consol-coal-resources-schedules-fourth-quarter-2017-earnings-release-and-conference-call-300579284.html\nSOURCE CONSOL Coal Resources LP and CONSOL Energy Inc. 0.5267 2 76.83%
Ryan

Ryan

Data Scientist

Leave a Reply