#### 22.6 A neural mention-ranking algorithm

###### Describe the neural mention-ranking system introduced in “End-to-end neural coreference resolution” paper by Lee, K., He, L., Lewis, M., and Zettlemoyer, L. (2017b).

This neural system considers every span of text up to a set length (different n-grams) as a possible mention. The task is to assign an antecedent to each span i by looking at previous mentions (spans) and a special dummy variable. The dummy token is to ensure i does not have an antecedent if it’s discourse-new or non-anaphoric. For each pair of spans i and j, the system assigns a score s(i, j) to determine the coreference link between the two spans. The system learns a distribution over the antecedents for span i by comparing it to previous mentions.

###### How do we build the representation of span i?

The process consists of two parts:

1. A contextual representation of the first and last word in the span

2. A representation of the headword of the span

The contextual representation of the first and last word in the span are computed by taking in a contextual embeddings like ELMo or BERT and feed it into a biLSTM. The output for each word is the concatenation of the left-forward and right-forward LSTM. The system uses attention over the words to represent the headword of the span. Overall the contextual representation of span i is a vector that concatenates the hidden representation of the start and end tokens of the span, the headspan vector (the attention vector), and a feature vector (the length of span i). The representation of span i is then used to compute the mention score. The figure below showcase the whole process.

###### What does the score s(i, j) includes?

It includes three factors:

1. m(i) – whether span i is a mention

2. m(j) – whether span j is a mention

3. c(i, j) – whether span j is the antecedent of i

s(i, j) = m(i) + m(j) + c(i, j)

The scoring function c(i, j) takes in 4 types of inputs:

1. The vector representation of span i

2. The vector representation of span j

3. The element-wise similarity of the two spans to each other

4. The feature vector which includes different useful features such as mention distances, etc.

The overall computation of the score(i, j) is illustrated in the figure below.

###### What does our training dataset look like for training this neural mention-ranking system?

We don’t have a single gold antecedent for each mention but instead we have an entire cluster of coreferent mentions for each mention and each mention has a latent antecedent. We have a loss function that maximises the sum of the coreference probability over any legal antecedents. The loss function is cross-entropy loss that takes the log of the coreference probability.

#### 22.7 Evaluation of Coreference Resolution

###### What are the 5 common metrics for evaluating coreference algorithms?

The coreference algorithms are evaluated by comparing the predicted mention clusters with the gold clusters or human-generated clusters. The 5 common metrics are:

3. B^3 (mention-based)

4. CEAF (entity-based)

###### What is the MUC F-measure?

The MUC F-measure is based on the number of coreference links common to generated clusters and gold clusters. Precision is measured by the number of common links divided by total number of links in the generated clusters. Recall is the number of common links divided by total number of links in the gold clusters. MUC is bias towards systems that produce large chains and it ignores singletons!

###### What is the B^3 measure?

For each mention in the reference chain, we would compute a precision and recall. We would then take a weighted sum over all the mentions to compute the overall precision and recall. For a given mention i, it is consider a correct mention if it appears in both the reference chain and the gold chain. The precision of a mention i is therefore, the number of correct mentions in gold chain containing entity i divided by the number of mentions in gold chain containing entity i. The recall of a mention i is the number of correct mentions in gold chain containing entity i divided by the number of mentions in the reference chain containing entity i. The precision and recall of mention i are weighted by a trained weight.

Data Scientist