For computational feasibility, it is best to restrict the size of candidates, Y(x). One approach is to use name dictionary, which maps strings (relevant mentions) to the entities. This is a many-to-many mapping. A name dictionary can be built from Wikipedia, with each Wikipedia entity page link to the anchor text of all hyperlinks that point to that page. You can improve the recall by using approx. matching but this might increase false positives as set of candidates grow.
Feature-based vs Neural entity linking
Feature-based approaches to entity linking rely on three main types of local information:
Similarity between mention string and the canonical entity name
The popularity of entity, which can be measured by Wikipedia page views or PageRank
The entity type (given by NER) to eliminate words that belong to irrelevant entity types
In addition to local features, document context can also be helpful in determining the entity that the mentions belong to. For example, if “Jamaica” is mentioned in the document about Caribbean, then it probably refers to the island nation but if it’s in the context of a menu, then it might be refer to a tea beverage. This can be executed by computing the similarity between the Wikipedia page describing each entity and the mention context. We can use TFIDF here.
On the other end of the spectrum is the neural network approach to entity linking. We can compute score for each entity candidate using the entities, mentions, and context word embeddings. Yang et al. (2016) employ a bilinear scoring function to do entity linking for Twitter. The scoring function is shown below:
Entity embeddings can be obtained from existing knowledge base or applying word embedding algorithm (e.g. GloVe) to Wikipedia text
The embedding of the mention x can be obtained by averaging out the embeddings of all the words in the mention
The embedding of the context c can also be computed similarly to the embedding of mention. An alternative method could be to train a denoising autoencoder to learn a function from raw text to dense K-dimensional vector encodings by minimising reconstruction loss
Both matrices can be trained by backpropagation from margin loss