The TwoStep Framework
Given two entity sets E1 and E2, an entity pair (e1, e2) is bilateral if e1 is in E1 and e2 is in E2. To mitigate quadratic complexity of generating all possible bilateral pairs, we can follow a twostep approach as shown in the figure below.

Blocking – uses a blocking key, a manymany function to cluster approx. similar entities into overlapping blocks. Only entities sharing the same block can be bilaterally paired

Similarity – Link specification function to evaluate candidates. This could be Boolean or probabilistic and it’s used to determine if a candidate entity pair represents the same underlying entity
Blocking
Given a set M of mention nodes, a blocking key K is a manymany function that takes a mention m as input and returns a nonempty set of literals. Note that blocking key can generate multiple blocking key values for each node. Given a blocking key, a candidate set of mention pairs can be generated by a blocking method using the blocking key values of the mentions. There are three popular methods, all of which assume a blocking key has been specified by the user:

Traditional Blocking

Sorted Neighbourhood

Canopies
Traditional Blocking
Given a blocking key, generate the candidate set C as the set. If two mention nodes share a blocking key, they would be paired and inserted into C. The problem with traditional blocking is data skew. For example, two People mentions are blocked based on tokens in their last names. Last name frequencies in many countries tend to bias towards certain values. This skew means that runtime of blocking methods ends up roughtly the same as the number of pairs generated by the largest block, which means it’s still roughly quadratic. A solution to this is block purging, where blocks that are too large can be safely ignored as large blocks tend to be indexed by blocking key values that are similar to stop words. The block purging requires a purging threshold and discard all blocks that have more pairs than this threshold.
Sorted Neighbourhood
The algorithm is as follows:

A single blocking key value is generated for each mention using a manyone blocking key

The blocking key values are used as sorting keys to impose ordering on the mentions

A window of constant size w is slid over the sorted list and all mentions sharing a window are paired and added to the candidate set
The workflow is illustrated below. The sliding window serve two implications for candidate set generation:

Mentions with different blocking key values may still get paired

Some mentions with the same blocking key value may not be paired
With window size smaller than the total number of mentions, this algorithm has linear time and space complexity. The main disadvantage is their reliance on a singlevalued blocking key. This was counter by using the multipass Sorted Neighbourhood, where multiple blocking keys could be used to improve coverage.
Canopies
Canopies is a clustering method that’s apply to blocking. It uses a distance function and two threshold parameters, tight (>0) and loose (>= tight). The algorithm works as follows:

A seed mention m is randomly chosen from M

All mentions that have distance less than loose are assigned to the canopy represented by m

Among these, the mentions with distance less than tight are removed from M

Repeat this process with another seed mention
Here, each canopy represents a block. For distance function, the method has been found to work well with a number of tokenbased set similarity measures such as Jaccard and cosine similarity.