Once you have the candidate set C of mention pairs, you will now perform similarity computations to determine which subset of C has duplicate mention pairs. Each mention pair can be independently assigned a score, with higher scores meaning greater chance of the pair being a duplicate pair. There are two issues:

  1. What methods do we use to compute the scores for each mention pair? Ideally we want the method to produce scores that’s as close to the ground-truth as possible

  2. The scores aren’t binary and so how can we use these scores to separate the set C into sets of duplicate and non-duplicate pairs?

In machine learning entity resolution, each mention pair in C is converted to a numeric feature vector. Given n properties and m functions, the feature vector would have m x n elements. These many features could be bad for machine learning generalisation given the features are highly likely to be correlated to each other. There are few methods to solve this:

  1. Apply feature selection method on all possible features

  2. Perform domain engineering to retrieve only certain features to each property

What feature functions should we use? There are many functions on string and phonetic similarity but not as much on numeric / date types. Below are the list of popular functions.

What should we do when there are multiple values per mention per property? We could use two-layer similarity function where the first layer is the atomic similarity function and the second layer is the aggregation. However, the more complicated the similarity function, the more degrees of freedom it tends to have, and therefore, the more options there are to explore.



Data Scientist

Leave a Reply