The Two-Step Framework

Given two entity sets E1 and E2, an entity pair (e1, e2) is bilateral if e1 is in E1 and e2 is in E2. To mitigate quadratic complexity of generating all possible bilateral pairs, we can follow a two-step approach as shown in the figure below.

1. Blocking – uses a blocking key, a many-many function to cluster approx. similar entities into overlapping blocks. Only entities sharing the same block can be bilaterally paired

2. Similarity – Link specification function to evaluate candidates. This could be Boolean or probabilistic and it’s used to determine if a candidate entity pair represents the same underlying entity

Blocking

Given a set M of mention nodes, a blocking key K is a many-many function that takes a mention m as input and returns a non-empty set of literals. Note that blocking key can generate multiple blocking key values for each node. Given a blocking key, a candidate set of mention pairs can be generated by a blocking method using the blocking key values of the mentions. There are three popular methods, all of which assume a blocking key has been specified by the user:

2. Sorted Neighbourhood

3. Canopies

Given a blocking key, generate the candidate set C as the set. If two mention nodes share a blocking key, they would be paired and inserted into C. The problem with traditional blocking is data skew. For example, two People mentions are blocked based on tokens in their last names. Last name frequencies in many countries tend to bias towards certain values. This skew means that run-time of blocking methods ends up roughtly the same as the number of pairs generated by the largest block, which means it’s still roughly quadratic. A solution to this is block purging, where blocks that are too large can be safely ignored as large blocks tend to be indexed by blocking key values that are similar to stop words. The block purging requires a purging threshold and discard all blocks that have more pairs than this threshold.

Sorted Neighbourhood

The algorithm is as follows:

1. A single blocking key value is generated for each mention using a many-one blocking key

2. The blocking key values are used as sorting keys to impose ordering on the mentions

3. A window of constant size w is slid over the sorted list and all mentions sharing a window are paired and added to the candidate set

The workflow is illustrated below. The sliding window serve two implications for candidate set generation:

1. Mentions with different blocking key values may still get paired

2. Some mentions with the same blocking key value may not be paired

With window size smaller than the total number of mentions, this algorithm has linear time and space complexity. The main disadvantage is their reliance on a single-valued blocking key. This was counter by using the multi-pass Sorted Neighbourhood, where multiple blocking keys could be used to improve coverage.

Canopies

Canopies is a clustering method that’s apply to blocking. It uses a distance function and two threshold parameters, tight (>0) and loose (>= tight). The algorithm works as follows:

1. A seed mention m is randomly chosen from M

2. All mentions that have distance less than loose are assigned to the canopy represented by m

3. Among these, the mentions with distance less than tight are removed from M

4. Repeat this process with another seed mention

Here, each canopy represents a block. For distance function, the method has been found to work well with a number of token-based set similarity measures such as Jaccard and cosine similarity.

Data Scientist