Why is entity matching in knowledge graph?

Entity matching allows you to build a knowledge graph without excessive duplication of the same entity. It is also important for when you want to update your knowledge graph, where you want to match the entity with the existing entities in your knowledge graph.

What’s the typical steps for entity matching?
1. Blocking

2. Matching

Blocking is where you would use heuristic rules to reduce the number of tuples you need to consider for matching. For example, you would use one of the column features as the determiner to determine if something should be consider for matching or not. Blocking uses a variety of similarity measures. Once you have done blocking, you can perform matching using ML algorithms such as random forest.

How can we train a ML to learn the blocking rules?
1. Take a sample S of tuple pairs from table A and B

2. Convert each tuple pair in S into feature numerical vectors (S’)

3. Train a random forest F on S’

4. Extract candidate blocking rules from F

5. Find and return a good rule sequence

How do we do entity matching once we have the blocking rules?
1. Apply blocking rules to tables A and B to get set of tuple pairs C

2. Define a set of features based on the schemas of A and B

3. Convert C into feature vectors C’

4. Train a random forest G on C’

5. G is now the matcher and will be apply to all remaining pairs in C’ to predict whether it’s a match or not

What is a heterogeneous knowledge graphs?

It’s a type of knowledge graph that includes different types of entities and relations.

1. Entity resolution

2. Name disambiguation

3. Spam / Fake papers identification

4. Research topics identification

5. Inference on future scientific impact

6. Inference on collaboration and team formation

7. Inference on future paper title

Can we pretrain with heterogenous knowledge graphs and leverage the model to solve different downstream tasks and applications?

We introduced the heterogenous graph transformer which it’s a model that learn to capture complex dependencies of entities with different types, relationships, and timestamps (no labelled data). There are three main components:

1. Heterogeneous Mutual Attention

2. Heterogeneous Message Passing

3. Target-Specific Aggregation

The first step is to compute the mutual attention between the source nodes and the target node. The second step is to compute the message propagating from source to target node. The third step is to aggregate the output from step 1 and 2.

What’s event-driven info-NCE training?

It’s a learning objective that can guide a single model to learn all the relationships simultaneously.