The most common evaluation metrics for NER are precision, recall, and f1-score at the token level but when using the predicted named-entities for downstream tasks, it’s better to evaluate NER at the full named-entity level.

There are six simple scenarios for evaluating NER at the token level:

  1. Surface string and entity type match

  2. System predicted a new entity mention

  3. System misses an entity

  4. System assigns the wrong entity type

  5. System gets the boundaries of the surface string wrong

  6. System gets both wrong entity type and surface string

The different evaluation schemas for NER

  1. CoNLL – A named entity is only correct if it is an exact match of the corresponding entity in the dataset (scenarios 1 – 3)

  2. Automatic Content Extraction (ACE)

  3. Message Understanding Conference (MUC) – an evaluation metric considering different categories of errors. It’s split into 5 different annotations: correct, incorrect, partial, missing, and spurious

  4. SemEval – introduced 4 different ways to measure precision/recall/f1-score based on MUC. There are: strict, exact, partial, and type



Data Scientist

Leave a Reply