Named Entity Recognition (NER)
NER is usually the first task of an information extraction problem. Given a document and a set of entity types (defined in an ontology), a NER system returns a set of extracted mentions. Entity resolution is the task of resolving the same entity into a single named entity. This usually includes coreference resolution.
Overall the full NER can be broken into two problems:
Detecting named entities
Classification of named entities by ontological type
The second problem involves assigning a concept to the detected named entity. The complexity quickly increases when we expands the ontology to just a set of entity types. Some event ontologies can be very hierarchical and therefore contain fine-grained classes. This means that we would want to tag a detected mention with the most fine-grained type, which it’s harder to do than general concepts such as Person and Location.
The most frequent technique for NER is the supervised methods. This includes HMMs, SVMs, and CRFs. Overall, these systems tend to create disambiguation rules based on the large labelled datasets that you feed into it. The overall performance of the baseline model largely depends on the vocabulary transfer, which is the proportion of words appearing in both training and testing corpus. This vocabulary transfer problem means that achieving generalisation is an important aspect of an NER model.
A popular semi-supervised approach is “bootstrapping”, which requires a set of seeds for initiating the learning process. This involves using small number of training examples and train the model to learn different contextual clues such as surrounding words or word emebddings. With the contextual clues, the model can then further identify entities that appear in similar contexts. By iterating this process, a large number of data and contexts will eventually be discovered.
What is mutual bootstrapping?
It involves growing a set of entities and contexts by starting with a handful of seed entity examples of a given type and aggregating patterns from the seed entities. Contexts are used to achieve certain level of generalisation. Empirical results have shown that this technique can achieve up to 88% precision by starting with just a seed of 10 examples and slowly growing it to one million facts.
You can select the right unlabelled dataset using information retrieval metrics on relevancy and the size of the data. In addition to quantity, the specific rich contexts of text is also equally important in order to train NER to be more general.
Here, we look into clustering of entities using context similarity, lexical resources, and patterns. For example, the term background knowledge is use to identify certain patterns that are specific to the domain of the dataset, for example, named entities often appear together in news articles whereas common nouns do not. Some research proposed information retrieval as a feature to determine whether a named entity can be classified under a certain entity type. A high PMI-IR means that the expressions tend to co-occur.