Entity Linking with DBPedia

The goal here is to extract entities (persons, organisations, events etc.) from financial news articles. We also processed the text, identified different coreference clusters and resolved them.

Finally, we link those extracted entities with real-world entities from the DBPedia knowledge base. Specifically, we want to a) link extracted entities to a real-world entity and b) find out which extracted entities are key entities by using the “support” value as proxy. The “support” represents how prominent the entity is in Lucene Model, i.e. number of inlinks in Wikipedia. We also use the coreference clusters to compute how often an entity appear within the document as a measure of key entities.

The pipeline is as follows:

  1. Feed news articles
  2. Extract entities – persons, organisations, countries, events
  3. Extract coreference clusters
  4. Resolve coreference clusters (replace all the mentions within a cluster with the main entity)
  5. Count the occurrence of the extracted entities
  6. Entity extraction and linking using DBPedia
  7. Identify key entities using the occurrence count and “support” value from DBPedia

The main output of this pipeline is news articles map to key entities!

Function to perform entity linking with DBPedia Spotlight’s API

In [8]:
# An API Error Exception
class APIError(Exception):
    def __init__(self, status):
        self.status = status
    def __str__(self):
        return "APIError: status={}".format(self.status)
In [9]:
def entity_linking(doc):
    # Base URL for Spotlight API
    base_url = "http://api.dbpedia-spotlight.org/en/annotate"

    # Parameters 
    # 'text' - text to be annotated 
    # 'confidence' -   confidence score for linking
    params = {"text": doc, "confidence": 0.8}
    # Response content type
    headers = {'accept': 'application/json'}
    # GET Request
    res = requests.get(base_url, params=params, headers=headers)
    if res.status_code != 200:
        # Something went wrong
        raise APIError(res.status_code)
    
    data = json.loads(res.text)
    
    return data

Method to extract key entities using NER count occurrence and DBPedia

  • There’s a limit to the API call to DBPedia so I created another method below that extract key entities without DBPedia
In [62]:
def key_entities_extraction_pipeline_with_entity_linking(doc, top_n):
    
    # Extract entities
    persons, organisations, countries, events, all_entities = extract_entities(doc)
    
    # Extract coreference clusters
    coreference_clusters = extract_coref_clusters(doc)
    
    # Resolve coreference clusters
    resolved_text = doc._.coref_resolved; resolved_text
    
    # Count the occurrence of entities in resolved_text
    overall = list(set([i[0] for i in (persons + organisations + countries + events)]))
    overall_count = [(key, resolved_text.count(key)) for idx, key in enumerate(overall)]
    overall_count.sort(key=lambda tup: tup[1], reverse = True)  # sorts in place
    
    # Entity extraction and linking using DBPedia
    data = entity_linking(resolved_text)
    entity_dbpedia = list(set([(i['@surfaceForm'], i['@URI'], i['@support']) for i in data['Resources']]))
    entity_dbpedia.sort(key=lambda tup: int(tup[2]), reverse = True)  # sorts in place
    
    # Identify key entities using count occurrence and "support" from DBPedia
    key_entities = {}

    for i in overall_count[:top_n]:
        key_entities[i[0]] = [i[1]]
    for i in entity_dbpedia[:top_n]:
        if i[0] in key_entities:
            key_entities[i[0]].append((i[1], i[2]))
        else:
            key_entities[i[0]] = [(i[1], i[2])]
    
    return key_entities

Example: Step-by-step walkthrough

In [246]:
key_entities_extraction_pipeline_with_entity_linking(docs[9000], 3)
Out[246]:
{'Wang': [4],
 'China': [3, ('http://dbpedia.org/resource/China', '192191')],
 'Etonkids Educational Group': [2],
 'Beijing': [('http://dbpedia.org/resource/Beijing', '36934')],
 'CNBC': [('http://dbpedia.org/resource/CNBC', '3308')]}
In [152]:
test_doc = docs[9000]; test_doc
Out[152]:
Vivien Wang, founder and chief executive of Chinese kindergarten company Etonkids Educational Group, spoke with CNBC on Tuesday about tech in the classroom.
In fact, each classroom in the company's kindergartens features an interactive white board that allows teachers to access resources on the internet, she said.
"In every single classroom at our bilingual international campuses, we do have whiteboards," said the company's founder and chief executive, Vivien Wang.
The early education services provider owns over 50 campuses across 18 cities in China. Some 10,000 children are enrolled in its classes.
The group supplements its teaching with augmented reality in products like interactive flashcards, Wang told CNBC on the sidelines of the Morgan Stanley China Technology, Media and Telecoms Conference in Beijing .
Education is big business in China, where all couples are now able to have two children after decades of a strict one-child policy.
In [153]:
persons, organisations, countries, events, all_entities = extract_entities(test_doc)
In [154]:
coreference_clusters = extract_coref_clusters(test_doc)
In [157]:
resolved_text = test_doc._.coref_resolved; resolved_text
Out[157]:
'Wang, spoke with CNBC on Tuesday about tech in the classroom.\nIn fact, each classroom in the company\'s kindergartens features an interactive white board that allows teachers to access resources on the internet, Wang said.\n"In every single classroom at our bilingual international campuses, we do have whiteboards," said the company\'s founder and chief executive, Vivien Wang.\nThe early education services provider owns over 50 campuses across 18 cities in China. Some 10,000 children are enrolled in The early education services provider classes.\nChinese kindergarten company Etonkids Educational Group supplements Chinese kindergarten company Etonkids Educational Group teaching with augmented reality in products like interactive flashcards, Wang told CNBC on the sidelines of the Morgan Stanley China Technology, Media and Telecoms Conference in Beijing .\nEducation is big business in China, where all couples are now able to have two children after decades of a strict one-child policy.'
In [125]:
overall = list(set([i[0] for i in (persons + organisations + countries + events)]))
In [174]:
overall_count = [(key, resolved_text.count(key)) for idx, key in enumerate(overall)]
In [178]:
overall_count.sort(key=lambda tup: tup[1], reverse = True)  # sorts in place
In [179]:
overall_count
Out[179]:
[('Wang', 4),
 ('China', 3),
 ('Etonkids Educational Group', 2),
 ('CNBC', 2),
 ('Media and Telecoms Conference', 1),
 ('Vivien Wang', 1),
 ('Beijing', 1),
 ('Morgan Stanley China Technology', 1)]
In [159]:
data = entity_linking(resolved_text)
In [183]:
entity_dbpedia = list(set([(i['@surfaceForm'], i['@URI'], i['@support']) for i in data['Resources']]))
In [188]:
entity_dbpedia.sort(key=lambda tup: int(tup[2]), reverse = True)  # sorts in place
In [189]:
entity_dbpedia
Out[189]:
[('China', 'http://dbpedia.org/resource/China', '192191'),
 ('Beijing', 'http://dbpedia.org/resource/Beijing', '36934'),
 ('CNBC', 'http://dbpedia.org/resource/CNBC', '3308'),
 ('Morgan Stanley', 'http://dbpedia.org/resource/Morgan_Stanley', '1729'),
 ('augmented reality',
  'http://dbpedia.org/resource/Augmented_reality',
  '1215'),
 ('international', 'http://dbpedia.org/resource/International_school', '728'),
 ('one-child policy', 'http://dbpedia.org/resource/One-child_policy', '436')]
In [211]:
key_entities = {}

for i in overall_count[:3]:
    key_entities[i[0]] = [i[1]]
In [212]:
for i in entity_dbpedia[:3]:
    if i[0] in key_entities:
        key_entities[i[0]].append((i[1], i[2]))
    else:
        key_entities[i[0]] = [(i[1], i[2])]
In [213]:
key_entities
Out[213]:
{'Wang': [4],
 'China': [3, ('http://dbpedia.org/resource/China', '192191')],
 'Etonkids Educational Group': [2],
 'Beijing': [('http://dbpedia.org/resource/Beijing', '36934')],
 'CNBC': [('http://dbpedia.org/resource/CNBC', '3308')]}
Ryan

Ryan

Data Scientist

Leave a Reply