What is T5 and C4?
Transfer learning has been pushing the progression in NLP since 2017. Google has introduced Text-To-Text Transfer Transformer (T5) that’s pretrained on the Collossal Clean Crawled Corpus (C4). The C4 dataset is a new open-source pre-training dataset created by Google and in the “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” paper, it’s shown that T5 pretrained on C4 has achieved SOTA results on many NLP tasks.
T5 frames all NLP tasks into a unified text-to-text format. This allows us to use the same model, loss function, and hyperparameters on any NLP tasks by simply adjusting the prefix! For example, to use T5 for summarisation, you would add “summarize: ” prefix to the source article. To use T5 for translation, you would add “translate English to German: ” prefix to the input text.
A good pre-training datasets should be a) large corpus size, b) high quality, and c) diverse in nature. Many pre-training datasets like Wikipedia and Common Crawl failed to meet all three criteria. That’s why we created C4, which it’s a cleaned version of Common Crawl that is 2x larger than Wikipedia pretraining dataset. Our cleaning process involves discarding incomplete sentences, removing offensive and noisy content, and deduplication. This cleaning process led to better results on downstream tasks and allow us to build bigger model without overfitting due to the diversity and large training dataset.
What’s the main findings from systematically reviewing recent ideas for NLP Transfer Learning?
Encoder-decoder models generally outperformed decoder-only language models
Fill-in-the-blank-style denoising objectives worked best and the most important factor is computational cost
Pre-training on in-domain data can be beneficial but pre-training on smaller datasets can lead to overfitting
The pre-train-then-fine-tune approach can lead to competitive results but requires careful choosing how often the model is trained on for each task
The scalability of model size, training time, and the number of ensemble models is determined by the fixed compute power
We applied the best methods in our main findings and scaled our T5 to 11 billion parameters, achieving SOTA results in GLUE, SuperGLUE, SQUAD, and CNN/Daily Mail datasets. T5 was able to achieve near-human score on SuperGLUE which was designed to be difficult for ML models but easy for humans.