Named Entity Recognition and Coreference Resolution using SpaCy

The goal here is to extract entities (persons, organisations, events etc.) from financial news articles. We also processed the text, identified different coreference clusters and resolved them.

Finally, we link those extracted entities with real-world entities from the DBPedia knowledge base. Specifically, we want to a) link extracted entities to a real-world entity and b) find out which extracted entities are key entities by using the “support” value as proxy. The “support” represents how prominent the entity is in Lucene Model, i.e. number of inlinks in Wikipedia. We also use the coreference clusters to compute how often an entity appear within the document as a measure of key entities.

The pipeline is as follows:

  1. Feed news articles
  2. Extract entities – persons, organisations, countries, events
  3. Extract coreference clusters
  4. Resolve coreference clusters (replace all the mentions within a cluster with the main entity)
  5. Count the occurrence of the extracted entities
  6. Entity extraction and linking using DBPedia
  7. Identify key entities using the occurrence count and “support” value from DBPedia

The main output of this pipeline is news articles map to key entities!

Import Dependencies + Read Data

In [14]:
import json
import requests
import pandas as pd

import spacy
import neuralcoref
from spacy import displacy
In [2]:
df = pd.read_csv('news_data.csv')

Load and create SpaCy pipeline

We added the neural coreference component to our SpaCy pipeline

In [3]:
nlp = spacy.load("en_core_web_sm")
neuralcoref.add_to_pipe(nlp)
Out[3]:
<spacy.lang.en.English at 0x1259f2f98>
In [4]:
nlp.pipeline
Out[4]:
[('tagger', <spacy.pipeline.pipes.Tagger at 0x146a742e8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x124746948>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1247469a8>),
 ('neuralcoref', <neuralcoref.neuralcoref.NeuralCoref at 0x123e865a8>)]

Parse 10K data through the SpaCy pipeline

In [5]:
docs = list(nlp.pipe(df['text'][:10000]))

Two functions to extract entities and coreference clusters

In [60]:
def extract_entities(doc):
    persons = []
    organisations = []
    countries = []
    events = []
    all_entities = []

    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            persons.append((ent.text, ent.start_char, ent.end_char, ent.label_))
        elif ent.label_ == 'ORG':
            organisations.append((ent.text, ent.start_char, ent.end_char, ent.label_))
        elif ent.label_ == 'GPE':
            countries.append((ent.text, ent.start_char, ent.end_char, ent.label_))
        elif ent.label_ == 'EVENT':
            events.append((ent.text, ent.start_char, ent.end_char, ent.label_))
        all_entities.append((ent.text, ent.start_char, ent.end_char, ent.label_))
    
    return persons, organisations, countries, events, all_entities
In [61]:
def extract_coref_clusters(doc):    
    clusters = []
    for cluster in doc._.coref_clusters:
        clusters.append((cluster.main, cluster.mentions, cluster.__len__()))
    
    return clusters

Extract key entities for the 10K data

In [11]:
data_10k = df[:10000][['id', 'title', 'url', 'text', 'published']]
In [12]:
data_10k.head()
Out[12]:
id title url text published
0 4f2fec4a4d32d0f564e5da74188b51e5317e4826 EMERGING MARKETS-Mexican peso seesaws over dol… https://www.reuters.com/article/emerging-marke… (Updates prices, adds Trump comments) By Rodri… 2018-01-26T01:01:00.000+02:00
1 4a9f98e22b7a76a6db40aef6accb24b1688eecb3 Migrants must visit Nazi concentration camps, … https://www.reuters.com/article/us-germany-ant… BERLIN (Reuters) – New migrants to Germany mus… 2018-01-10T21:52:00.000+02:00
2 941e5abdc725739fed17da543d6c449d342b8506 Euro zone businesses start 2018 on decade high https://www.reuters.com/video/2018/01/24/euro-… Euro zone businesses start 2018 on decade high… 2018-01-24T19:14:00.000+02:00
3 038916903c446a14b1096e673c86b74bfbf5cbcc Russia’s Lavrov says ‘unilateral actions’ by U… https://www.reuters.com/article/us-mideast-cri… MOSCOW (Reuters) – “Unilateral actions” by the… 2018-01-21T20:31:00.000+02:00
4 4fe331b8aedfe7386bfa11c5768629ba068f6bcc Lawmakers to Justice Department: Keep online g… https://www.cnbc.com/2018/01/12/the-associated… ATLANTIC CITY, N.J. (AP) — Federal lawmakers w… 2018-01-12T16:55:00.000+02:00

Method for extract key entities using just NER count occurrence

In [79]:
def key_entities_extraction_pipeline(doc, top_n):
    
    # Extract entities
    persons, organisations, countries, events, all_entities = extract_entities(doc)
    
    # Extract coreference clusters
    coreference_clusters = extract_coref_clusters(doc)
    
    # Resolve coreference clusters
    resolved_text = doc._.coref_resolved; resolved_text
    
    # Count the occurrence of entities in resolved_text
    overall = list(set([(i[0], i[-1]) for i in (persons + organisations + countries + events)]))
    overall_count = [(key[0], key[1], resolved_text.count(key[0])) for idx, key in enumerate(overall)]
    overall_count.sort(key=lambda tup: tup[2], reverse = True)  # sorts in place
    
    # Identify key entities using count occurrence and "support" from DBPedia
    key_entities = {}

    for i in overall_count[:top_n]:
        key_entities[i[0]] = [i[1], i[2]]
    
    return key_entities, all_entities
In [89]:
all_key_entities = []
all_entities = []
In [90]:
for idx, doc in enumerate(docs):
    key_entities, entities = key_entities_extraction_pipeline(doc, 10)
    all_key_entities.append(key_entities)
    all_entities.append(entities)
In [91]:
data_10k['key_entities'] = all_key_entities
In [92]:
data_10k['all_entities'] = all_entities
In [93]:
data_10k.head()
Out[93]:
id title url text published key_entities all_entities
0 4f2fec4a4d32d0f564e5da74188b51e5317e4826 EMERGING MARKETS-Mexican peso seesaws over dol… https://www.reuters.com/article/emerging-marke… (Updates prices, adds Trump comments) By Rodri… 2018-01-26T01:01:00.000+02:00 {‘U.S.’: [‘GPE’, 7], ‘Mexico’: [‘GPE’, 6], ‘Lu… [(Trump, 22, 27, ORG), (Rodrigo Campos, 41, 55…
1 4a9f98e22b7a76a6db40aef6accb24b1688eecb3 Migrants must visit Nazi concentration camps, … https://www.reuters.com/article/us-germany-ant… BERLIN (Reuters) – New migrants to Germany mus… 2018-01-10T21:52:00.000+02:00 {‘Germany’: [‘GPE’, 13], ‘Schuster’: [‘PERSON’… [(BERLIN, 0, 6, ORG), (Reuters, 8, 15, ORG), (…
2 941e5abdc725739fed17da543d6c449d342b8506 Euro zone businesses start 2018 on decade high https://www.reuters.com/video/2018/01/24/euro-… Euro zone businesses start 2018 on decade high… 2018-01-24T19:14:00.000+02:00 {‘Germany’: [‘GPE’, 2], ‘David Pollard’: [‘PER… [(2018, 27, 31, DATE), (decade, 35, 41, DATE),…
3 038916903c446a14b1096e673c86b74bfbf5cbcc Russia’s Lavrov says ‘unilateral actions’ by U… https://www.reuters.com/article/us-mideast-cri… MOSCOW (Reuters) – “Unilateral actions” by the… 2018-01-21T20:31:00.000+02:00 {‘the United States’: [‘GPE’, 6], ‘Lavrov’: [‘… [(MOSCOW, 0, 6, ORG), (Reuters, 8, 15, ORG), (…
4 4fe331b8aedfe7386bfa11c5768629ba068f6bcc Lawmakers to Justice Department: Keep online g… https://www.cnbc.com/2018/01/12/the-associated… ATLANTIC CITY, N.J. (AP) — Federal lawmakers w… 2018-01-12T16:55:00.000+02:00 {‘New Jersey’: [‘GPE’, 11], ‘the Justice Depar… [(ATLANTIC CITY, 0, 13, GPE), (N.J., 15, 19, G…

Create a dataframe that maps each key entity to the doc id

In [104]:
entity_doc_id = []
for i in range(len(data_10k)):
    key_entities = list(data_10k['key_entities'][i].keys())
    
    tuples = [(data_10k['id'][i], key_entity, data_10k['key_entities'][i][key_entity][0], data_10k['key_entities'][i][key_entity][1]) for key_entity in key_entities]
    entity_doc_id.append(tuples)
In [107]:
entity_doc_id_flatten = [item for sublist in entity_doc_id for item in sublist]
In [111]:
entity_doc_df = pd.DataFrame(entity_doc_id_flatten, columns=['id', 'key_entity', 'entity_type', 'no_occurrence'])
In [112]:
entity_doc_df
Out[112]:
id key_entity entity_type no_occurrence
0 4f2fec4a4d32d0f564e5da74188b51e5317e4826 U.S. GPE 7
1 4f2fec4a4d32d0f564e5da74188b51e5317e4826 Mexico GPE 6
2 4f2fec4a4d32d0f564e5da74188b51e5317e4826 Lula PERSON 6
3 4f2fec4a4d32d0f564e5da74188b51e5317e4826 Argentina GPE 5
4 4f2fec4a4d32d0f564e5da74188b51e5317e4826 Luiz Inacio Lula da Silva PERSON 5
72991 03c4bdb6c2f5d97b1959782c1bad6352fc8b6e69 EA ORG 3
72992 03c4bdb6c2f5d97b1959782c1bad6352fc8b6e69 BMO Capital Markets ORG 3
72993 03c4bdb6c2f5d97b1959782c1bad6352fc8b6e69 Electronic Arts ORG 2
72994 03c4bdb6c2f5d97b1959782c1bad6352fc8b6e69 Reddit ORG 1
72995 03c4bdb6c2f5d97b1959782c1bad6352fc8b6e69 Gerrick Johnson PERSON 1

72996 rows × 4 columns

Ryan

Ryan

Data Scientist

Leave a Reply