In summarisation, researchers mainly use the ROUGE scores to evaluate the quality of summary generated by machine learning models. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation and it evaluates the quality of a summarisation model by measuring the number of overlapping text units (i.e. n-grams) between model generated and ground truth summaries. ROUGE-1 (unigram overlap), ROUGE-2 (bi-gram overlap), and ROUGE-L (longest common subsequence) are the most popular ROUGE scores used to evaluate models. ROUGE-1 and ROUGE-2 are used as a proxy for measuring the informativeness and ROUGE-L as a proxy for measuring fluency in generated summary. You can compute ROUGE scores using the pyrouge toolkit, which provides recall, precision and F1-score for all the different ROUGE scores.
However, there are major limitations to using ROUGE score when evaluating abstractive summarisation model. As stated by See et al. (2017), when dealing with abstractive summarisation models, ROUGE scores might not be the best evaluation metrics as abstractive summaries tend to be subjective and so a well paraphrased summary might still score lowly in ROUGE scores despite successfully capturing salient information from the source. ROUGE scores also don’t assess the readability and/or fluency of generated summaries. Below is an example.
The figure above compares the summary generated by different deep learning models against the ground-truth summary. It shows that the models are very good in capturing the salient information of the source article, however, it would still score poorly against the ground-truth summary. This highlights one of the downfalls of the CNN/Daily Mail dataset and the limitation of using ROUGE scores as an evaluation metric, whereby the ground-truth summaries created by human are subjective. For example, in the figure above, the ground-truth summary has the phrase the brazil captain to refer to Neymar whereas in our generated summaries, we used the direct word Neymar. This shows that the ground-truth summary and our summaries are referring to the same entity but differently and this will reflect poorly in ROUGE scores.
Other common evaluation metrics include METEOR (Lavie and Agarwal, 2007) and BLEU (Papineni et al., 2002). METEOR is similar to ROUGE with the addition of stemming and synonym matching, in an attempt to counter the subjectivity of human paraphrasing. BLEU involves measuring how many words (or n-grams) in the generated summaries appeared in the source summaries.