Entity resolution is the task of determining if two entities refer to the same underlying entity. This task is a key step in any robust domain-specific knowledge graph pipeline. In natural language, we could use linguistic clues to determine when extracted pronouns in a document refer to the same entity. When the extractions and linking happens between documents, this is known as cross-document coreference resolution. In most entity resolution work, research tackle single-source, single-schema entity resolution. Some did work on multi-source but still remains the single-schema entity resolution.
Challenges and Requirements
The figure below provides some insights as to why entity resolution has been such a difficult task to automate. The most important challenge is resolving the ambiguity of the information extracted, especially in the presence of noise and without access to the underlying text. In addition, we have the issue of singleton nodes where the entity only showed up once and it’s not link to any other nodes. To counter ambiguity, humans tend to draw on background and knowledge. Another challenge is scalability, a computational challenge where naive solutions tend to grow quadratically with the number of nodes in KG.
Multi-schema entity resolution is still at a very early stage but it’s key to constructing knowledge graphs from many different sources, documents, etc. To develop a good training performance, entity resolution systems require lots of training data, which it’s hard to acquire. Out of the four aspects, entity resolution systems tend to perform well in automation and scalability but not so well in heterogeneity.
An ideal entity resolution solution should have a high degree of automation. This is usually achieved using a non-adaptive system but such a system has low robustness or real-world usage. However, to train an adaptive system, you would need training data or go with unsupervised techniques. In industry, people tend to leverage crowdsourcing for annotation service but this option is limited by cost and scalability.
Entity resolution system should meet the requirements of elastic scalability, ideally where computation resources increase linearly as data size increase. However, entity resolution tends to have quadratic complexity and research has been active in reducing the complexity down to linear (blocking).
Multi-schema heterogeneity can be broken down into two separate problems:
Type heterogeneity – when different ontologies are used for different raw data elements. For example, Inventor might be use in one ontology whereas Entrepreneur are used in the second ontology
Property heterogeneity – the matching of property or edge labels across ontologies. This involves aligning the properties between same underlying entities in different ontologies
The area of how can we build entity resolution that can be reused across domains? Domain adaptability in entity resolution system means the system must have the ability to adapt as domain changes, therefore, referring to the meta-ability of an entity resolution system to be retrained, redeployed, and reused on different domain with minimal overhead.