Building Knowledge Graphs

As mentioned in the previous post, there are 5 steps to building the knowledge graphs:

  1. Schema Design

  2. D2R Transformation

  3. Information Extraction

  4. Data Fusion with Instance Matching

  5. Storage Design and Query Optimisation

Describe the Schema Design process.

We manually design / extend the schema of the EKG since the schema changes when new data sources are added. At the first iteration, our EKG includes 4 basic concepts: “Company”, “Person”, “Credit”, and “Litigation” of which the major relations are “subsidiary”, “shareholder”, and “executive”. At the second iteration, we added “ListedCompany”, “Stock”, “Bidding”, and “Investment” to the EKG as well as new relations.

Describe the D2R Transformation process.

There are three steps to transform relational databases to RDF:

  1. Table splitting

  2. Basic D2R transformation by D2RQ

  3. Post processing

The original data tables consist of many databases that contain multiple entities and relations. To make the tables easier to understand and handle, we split the original tables into smaller ones: atomic entity tables, atomic relation tables, complex entity tables, and complex relation tables. An atomic entity table corresponds to a class and an atomic relation table corresponds to relation instances where the domain and the range are two classes. We use D2RQ to transform the atomic entity and relation tables into RDF. Specifically, the table names are the classes (concepts), the columns are properties and cell values are corresponding property values. The figure below showcase an example of table splitting. The final step is post processing which has three main steps:

  1. Meta property mapping. Provides ID annotation to facts that have meta properties. The meta properties will become the properties of the n-ary relation, giving us new triples

  2. Conditional taxonomy mapping. Determines whether the entity is map to the subClass based on whether the entity appears in the table related to the subClass

  3. Conditional class mapping

Describe Information Extraction process.

The EKG extracts information from various data sources, dealing with both semi-structured and unstructured data. Here, we need to extract different types of entities, binary relations, attribute value pairs, event extraction and synonym extraction. We adopted a multi-strategy learning method to extract multiple types of data. The process is as follows:

  1. Use HTML wrappers to extract entities and attribute value pairs of patent, stock, and bidding information

  2. Attribute value pairs of enterprises are extracted from infoboxes of encyclopedic using HTML wrappers

  3. Binary relations, events, and synonyms extraction on free texts require seeds annotation in sentences to learn patterns. Here, the quality of extracted information heavily depends on the number of annotated sentences. However, manual annotation are costly. Therefore, we use a set of Hearst patterns to extract data from free texts. For example, the Hearst pattern “X (acquire) Y” allows us to extract triples (companyA, acquire, companyB) from free texts. These extracted triples are used to label free texts, allowing us to perform distant supervised learning. We would generate extraction patterns from these annotated sentences. We score each extraction patterns based on how many sentences it catches and only include patterns that are above certain threshold. Once we have identified our extraction patterns, we would use the patterns to extract new information from other free texts. The new extracted information is added to annotation. The whole process is therefore iterative until no new information can be extracted

To deal with abbreviations, we used entity linking to link a company mentioned in free texts to companies in the EKG. We accomplished this in two steps:

  1. Candidate detection

  2. Disambiguation

In candidate detection step, we want to find candidate entities in the knowledge base that are referred by each mention. This requires us to normalise the names of the company, to ensure that we can calculate the similarity between the core word of the mention and the core word of the entity in knowledge base. We compute context similarity using cosine similarity between sentence containing the mention and the text description of the entity in the knowledge base. In the disambiguation step, we would select the most possible candidate to link.

Describe Data Fusion with Instance Matching process.

We need a method to fuse all these data together into the EKG. This involves instance matching of companies, people, and other concepts. The challenge here is that instance matching could be difficult if there are no attributes to connect different data sources together.

Describe Storage Design and Query Optimisation process.

We designed our own triple store on top of the existing MongoDB. We chose MongoDB because of its large bases, good query performance, mass data storage, and scalability. Query performance is improved using different methods:

  1. Design storage structure that supports efficient querying on meta properties and n-ary relations

  2. Use in-memory database Redis to store the data that are heavily queried

  3. Construct sufficient indexes where we build additional indexes on meta properties and n-ary relations

  4. Data sharding where we partition triples into different tables according to the data type of the property value

Ryan

Ryan

Data Scientist

Leave a Reply