Objective and Contribution

Explored different mechanisms on the transformer model and assess the impact it has on abstractive summarisation. The mechanisms include n-gram blocking, coverage loss, and pointer-generator (PG) network. The mechanisms attempt to alleviate the repetition and factual inconsistent problem. Results show that ROUGE scores improved as a result of these mechanisms.


Our baseline model is the transformer model. We explored two techniques to counter repetition problem:

  1. N-gram blocking

  2. Coverage loss

What is N-gram blocking?

N-gram blocking is added to our decoder. Decoder uses beam search to construct summaries and as it is selecting the next word, n-gram blocking would eliminates those words that would lead to an n-gram that already exists within the beam.

What is coverage loss?

The coverage loss involves creating a coverage vector that’s the sum of attention over all previous time steps. With the coverage vector, coverage loss is compute by minimising between the attention vector and the coverage vector.

PG Network Transformer

With PG Network, our transformer has the ability to copy or generate words at each given time step. The generation probability is computed using the hidden state, context vector, and decoder input.

Experiments and Results

The evaluation dataset is CNN/DailyMail and the evaluation metric is the ROUGE scores. We ran different variants of our PG-Transformer:

  1. PG-Transformer

  2. PG-Transformer + Coverage

  3. PG-Transformer + N-gram blocking


The results are displayed below. Note that the results are generated from only 24 hours of training and this is significant less than PG network, which took 4+ days and Sanjabi’s transformer, which took 2+ days. Comparing the transformer baseline with our PG-transformer, we can see that PG network was able to improve the performance of our model. We further improved the performance of our PG-transformer by adding coverage and n-gram blocking mechanism. We found that n-gram block yielded a much larger performance increase than coverage loss.

Baseline transformer generated summaries with lots of repetition and unable to handle OOV words as shown in the qualitative analysis below. Our PG-transformer was able to reduce OOV mishandling but still suffer from factual inconsistency. With the added coverage, we see that the summaries don’t have repetition problem anymore, however, repetition starts to appear in ideas and phrases level. The n-gram blocking eliminates all repetitions.

Conclusion and Future Work

Future work should involve hyper-parameter tuning and extended training time to make results more comparable to SOTA. We found that n-gram blocking was more effective in reducing repetition than coverage loss and although PG network alleviated the out-of-vocab (OOV) problem, models still produce factually inconsistent summary.



Data Scientist

Leave a Reply