Objective and Contribution

Proposed BLEURT, an evaluation metric that uses BERT to model human judgments. We uses a novel pre-training method to generalise our model using millions of synthetic examples before fine-tuning it on human ratings. BLEURT provides SOTA results on WMT Metrics shared task and the WebNLG Competition dataset. Lastly, BLEURT is robust to training data size and out-of-distribution.

What’s wrong with BLEU and ROUGE?

They both rely on N-gram overlap and as such are very sensitive to lexical variation and does not capture semantic or syntactic variations of a given reference.

What’s key to a robust learned metrics?
  1. Domain drifts. The ability for the metric to handle data outside of the initial domain the model was trained on

  2. Quality drifts. The best performing model in 2015 might not be able to make it to the top in 2019 evaluation. A robust metric should be able to adapt to this distribution drifts

BLEURT metric

Our training data consists of the source document, the generated sentences based on the source document, and a human rating indicating how good does the generated sentences explain the source document. Our goal is to train a model that can predict the human rating.

Our BLEURT are trained in three steps:

  1. Vanilla BERT pre-training

  2. Pre-training on synthetic data

  3. Fine-tune on task-specific ratings

Fine-tuning BERT for Quality Evaluation

Here, we use BERT to learn contextualised embeddings of the source document and generated sentences. We take the first special [CLS] token and feed it into a linear layer for human rating prediction. We required thousands of examples to train BERT and the linear layer. This simple approach gives us SOTA results on WMT Metrics shared task. However, this approach requires a fair amount of data to fine-tune BERT which isn’t ideal.

Pre-training on Synthetic Data

This is the key component of our BLEURT metric. We generated a large volume of synthetic reference-candidate pairs to train our BERT on different lexical and semantic signals. We aim to generalise BERT by the following:

  1. Use a large volume of diverse set of reference sentences

  2. The sentence pairs should include different lexical, syntactic, and semantic dissimilarities

  3. Our pretraining objectives should capture step 2 so that BLEURT can learn them effectively

Generating sentence pairs

Here, we generate our synthetic sentence pairs by randomly perturbing 1.8 million segments from Wikipedia. This process involves mask-filling with BERT, backtranslation, and dropping words. Our final dataset has 6.5 million data points.

We randomly mask different positions in the Wikipedia sentences and fill them using the language model, allowing us to introduce lexical variations while keeping the sentence fluent. We utilised two masking methods: at random positions or contiguous sequences of masked tokens. Backtranslation involves translating one language to another and then translating back to the original language. We use backtranslation to generate paraphrases and perturbations. This allows us to have different variation of sentences with similar semantics. Lastly, we randomly drop words to create more examples, to train our BLEURT to capture sentence truncation and void predictions.

Pre-training signals

Here, we create the pre-training target vector for each sentence pair. Different pre-training task has different pre-training target vector. We have a total of 9 pretraining tasks as outlined in the figure below. Firstly, we have three different automatic signals created using BLEU, ROUGE, and BERTscore. Secondly, we have backtranslation likelihood to measure semantic similarity. Given perturbation of z, what’s the probability that the perturbation is a backtranslation of z. Third, we have textual entailment, which we have a BERT classifier to output the probability that z entail, contradict, and neutral when compared to the perturbation of z. Lastly, we have a backtranslation flag that tells us whether the perturbation is generated using backtranslation or mask-filling.


Our pre-training tasks are either regression or classification. We weighted sum the task-level losses. The pre-training tasks categorised by regression or classification are illustrated in the figure above.


Our evaluation datasets are WMT Metrics shared task and WebNLG. We apply our BLEURT on translation and data-to-text tasks and benchmark our metric against other metrics using the WMT Metrics shared task dataset. We evaluated the robustness of our metrics by applying BLEURT to WMT17 and WebNLG to assess how well it handles quality and domain drifts.

WMT Metrics Shared Task

For each year, we evaluate the agreement between automatic metrics and human rating using Kendall’s Tau and official WMT metric (either Pearson’s correlation or DARR). The results are displayed below in table 2, 3, and 4, showcasing the performance in different years. In 2017 and 2018, our BLEURT metrics outperformed the other baseline metrics in most language pair. BLEURT metrics remain competitive in 2019. We observed that BLEURT outperformed BLEURTbase in most of the cases, which it’s as expected. The key takeaway is that pre-training brings consistent improvements and BLEURT achieve SOTA results for every year of the WMT task.

Robustness to quality drift

Here, we want to prove that pre-training can improve the robustness of BLEURT to quality drifts. We computed different challenging datasets where the training data consists of low-rated translation whereas the testing data contains high-rated translation. The proportion is control by our skew factor as illustrated in the figure 1 below. As our skew factor increases, our training data decreases. The figure 2 below showcase BLEURT’s performance as the skew factor varies. We observe that a) all metrics decrease as we increase test skew and b) training skew heavily affects the performance of BLEURT without pre-training. We expected the first observation because as skew increases and ratings get closer, it becomes harder to distinguish between good and bad systems. The second observation proves our hypothesis that pre-training makes our BLEURT metric more robust to quality drifts.


The goal of evaluating BLEURT on WebNLG is to assess its ability to adapt to new tasks with limited training data. We applied BLEURT to three tasks from data-to-text. The task was essentially to produce description of entities. Each description is evaluated on 3 different aspects: semantics, grammar, and fluency. BLEURT was first pre-trained on synthetic data, then on WMT data, and then on WebNLG data. The results are displayed below, showcasing the correlation between the different metrics and human judgement. The key takeaway is that with pre-training, BLEURT was able to quickly adapt to new tasks.

Ablation Study

The figure below showcase the ablation study we did on WMT17 dataset, highlighting the relative importance of each pre-training task. The left bar chart compared the performance difference between BLEURT pre-trained to BLEURT without pre-training. The right bar chart compare full BLEURT to BLEURT pre-trained on all tasks except one. Overall, pre-training on high quality signals like BERTscore, entailment, and backtranslation yield improvements in BLEURT.

Conclusion and Future Work

Future work involves multilingual NLG evaluation, applying the metrics to other NLG tasks, and combining both humans and classifiers evaluation into a single method.



Data Scientist

Leave a Reply