In domain-specific knowledge graph, features are very important. Expert and rule-based systems make use of features to perform different ML tasks such as NER. Features that are commonly used for NER can be categorised into three different axes:

  1. Word-level

  2. List lookup

  3. Document and Corpus


This is self-explanatory, where we derive word-level features such as word case, punctuation, special characters, and numerical value. Numerical values play an important part for NER as it could represents dates, percentages, etc. Another word-level feature is morphological features, where we deal with affixes and roots. For example, certain prefix or suffix could be a feature for NER to determine whether an entity is a person or not. For word-level features to be useful, they must be combined with other features to create a robust learning algorithms. Another feature is the character n-gram feature function. This feature is robust to spelling errors and character n-grams method works well with other important techniques in information retrieval.

List lookup

Knowledge graph that relies on background knowledge derived from public sources will find list lookup features to be extremely important. A fundamental limitation from building knowledge graph this way is because the knowledge base generally consists of well-known entities whereas in many domains, it is the less well-known entities that we are interested in. To tackle entity disambiguation, we could potentially use the following methods:

  1. Stemming and lemmatisation before matching

  2. Use threshold edit-distance to match candidate words fuzzily

  3. Phonetic algorithms such as Soundex and Metaphone can be used to match against a reference list

Document and Corpus

This group of feature includes both document content and structure. One potential feature is to simply extract words that appear both in uppercased and lowercased. Those extracted words are considered to be common nouns. Another feature is to identify multiple occurrences of a unique entity. This falls under the coreference resolution task.



Data Scientist

Leave a Reply