Today’s post is the first of its kind. In our last post, we constructed our news articles ingestion pipeline where it takes in the “latest” news articles and output the different NLP results that we are interested in. Those results are processed and saved into three different databases (csv files in our case). We have three databases:

  1. Database – contains all the results of our features

  2. Sentence-level data – contains sentence level topics and sentiment of each document in our database

  3. Entities-doc mapping – contains mappings between entities and document ids

Our goal is to build a simple web application that allows the users to select the person / companies (entities) that they would like to get the latest news of. We will be building our web application using React as the frontend and Flask as the backend. Today’s post will cover the data flow as well as the Flask backend API logic.

Data flow between Frontend and Backend

The figure below showcase the data communication between our frontend and backend. The user will begin by entering / choosing the entity (person / company) that they would like to track. Our frontend will then take the chosen entity and query the backend database. We will use the selected entity to retrieve all the relevant document ids from the entity-doc-mappings database. We would then use the document ids to retrieve all the relevant documents from the main database, filtering out fake news, duplicate content, and out-dated articles. Lastly, we would use the latest document ids to retrieve the sentence-level data from sentence-level-topic-modelling database. The relevant document and sentence-level results are passed back to the frontend to display to our users.

The Flask Backend

Here, we will create two APIs that does the following:

  1. Retrieve the relevant documents based on user’s chosen entity
  2. Retrieve sentence-level data based on the document id
Import Dependencies
import json
import datetime
import pandas as pd

from flask import Flask
The Flask API
app = Flask(__name__)

# We're using the new route that allows us to read an entity from the URL
def track_entities(entity):

    doc_ids = list(set(entity_doc_mappings['id'][entity_doc_mappings['key_entity'] == entity]))
    if len(doc_ids) == 0:
        return json.dumps({'entity_name': "Entity not found!"})

    # Retrieve relevant news articles that are within a week of today
    entity_data = database[database['id'].isin(doc_ids)]
    entity_data = entity_data[entity_data['published_date'] <= today] entity_data = entity_data[entity_data['published_date'] > today_minus_7days]

    # Retrieve sentence level data
    doc_ids_latest = entity_data['id']
    entity_data_json = entity_data.to_json(orient='records')

    return json.dumps({'entity_name': entity, 'data': entity_data_json})

def get_document(doc_id):

    # Retrieve sentence level data
    entity_data_sentence_level = sentence_level_topic[sentence_level_topic['id'].isin([doc_id])]
    entity_data_sentence_level_json = entity_data_sentence_level.to_json(orient='records')

    return json.dumps({'doc_id': doc_id, 'sentence_data': entity_data_sentence_level_json})
The Main Function – To load all the databases (csv files) necessary to run the APIs
if __name__ == '__main__':
    database = pd.read_csv('./database/database.csv')
    database['published_date'] = database['published_date'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
    sentence_level_topic = pd.read_csv('./database/sentence-level-topic-modelling.csv')
    entity_doc_mappings = pd.read_csv('./database/entity-doc-mappings.csv')

    today = '2018-01-07'
    today = datetime.datetime.strptime(today, "%Y-%m-%d")
    today_minus_7days = today - datetime.timedelta(days=7)


Data Scientist

Leave a Reply