Approach Overview

A knowledge graph consists of a schema graph, a data graph, and the relations between the schema and the data graph. The schema graph has N set of nodes representing classes (concepts), P set of nodes representing properties that users have defined, and E set of edges representing the relationship between classes in the graph. What is meta property? It’s the additional properties that exist with a relation. If a relation is link to more than one subject or object, it has n-ary relation.

The data graph has N set of nodes representing instances and literals, P set of nodes representing properties, and E set of edges representing relationships between nodes. Each edge represents a fact (subject, predicate, object). If the object of a triple is an instance, the property is a relation, otherwise, the property is a datatype property.

Lastly, the relations R that connects the schema graph and data graph link the instances in the data graph to the classes in the schema graph by the rdf:type property.

What are the data sources and related tasks to construct the EKG?

Firstly, the knowledge graph is mainly based on structured enterprise information from CSAIC, which has data on 40 million companies, 60 million people, 8 million litigation and so on. The information about a company and a person has different relevant attributes. We transform the relational databases into RDF to build a basic KG.

Secondly, we look at patent websites that contain a large amount of patent information. Here we extracted 5 million patent information to build a patent KG. The basic KG and patent KG are linked with companies and persons to form EKG.

Thirdly, we extracted 3 million enterprise bidding information which includes investor, investee, invest time and so on. We also extracted stock information of listed companies. The EKG is fused with the new extracted information.

Lastly, we extracted competitive relations and acquisition events from encyclopedic sites and added it to the EKG through the company name and person name. See the figure below for an example.

Note that when pulling information from different sources, we might face the problem of inconsistency and conflicts. We attempted to solve this issue by selecting the latest information based on the pages’ update times.

Describe the data-driven KG constructing process.

There are 5 main steps:

  1. Schema Design

  2. D2R Transformation

  3. Information Extraction

  4. Data Fusion with Instance Matching

  5. Storage Design and Query Optimisation

The whole constructing process is data-driven and iterative and whether D2R transformation or information extraction step is initiated is dependent on the new data sources. New iteration begins when we receive input of new data sources.



Data Scientist

Leave a Reply