Presented an approach to build enterprise-level knowledge graphs (KGs) that incorporates information about 40 million enterprises in China. The paper also provide querying about enterprises and data visualisation capabilities as well as novel investment analysis scenarios and others. The KG is being used by two securities companies in their IB and investment consulting businesses.
There are two types of challenges:
One of the main business challenges is data privacy. How can we provide useful analysis without violating the privacies of a company and its employees. The second is the killer services on graph, where EKG is too complex and big that users can information overloaded when we deliver raw information directly.
To solve the data privacy problem, we can transform data into rank form. Additionally, we obscure critical nodes and not show it when visualising the EKG. Lastly, we provide UI interfaces which only allow certain types of queries. Specifically, we provide services that directly meets business needs.
The technology challenges arises from the diversity and scale of data sources. We extracted data from relational databases, bidding information, stock information, patent information, and competitor relations and acquisition events. While extracting, we faced the following challenges:
Data Model – There are many different data types we have to deal with, not to mention relations in EKG are not binary relations. There are meta properties and events. Meta properties are specific attribute that’s related to certain relations. For example, if a person is employed by a company, then the “employ” relation has an additional “entry time” attribute. For the events, there are many entities connected in a single event
D2R Mapping – There are difficulty mapping Relational databases to RDF form. This includes mapping of meta property, data mapping to different classes in RDF, and data in the same relational tables have different classes
Information Extraction – Certain information are extracted from encyclopedia in free texts. This means that it’s not easy to extract useful data with high accuracy. Entity extractions can become very challenging depending on the quality of the data
Query Performance – The number of triples of EKG has reached billions and there are many complex query patterns when EKG usage scenarios increases
The potential solutions for the challenges above is that for dealing with free texts, we can first extract information from semi-structured sources, then use the extracted data to do distant supervised learning to extract more data from free texts. We can also use graph-based entity linking algorithms to perform entity linking, a major step in entity extraction. We can calculate similarity between mentions and entities in KB to find candidate entities, then construct an undirected graph to complete disambiguation.