What’s the limitation of the popular pre-trained then fine-tuned language models approach?
While it’s good that language models can be directly fine-tuned for specific NLP tasks, making the architecture task-agnostic, there’s still a need for large supervised dataset to fine-tune language models in order to achieve strong performance.
What are the reasons for removing the limitation above?
It’s difficult to acquire large supervised dataset, especially for new / less popular NLP tasks and so removing this limitation allows us to tackle a whole new spectrum of NLP tasks
The use of large supervised dataset to fine-tune post pre-training could lead to less generalisation and perform poorly in data outside of the training distribution
Humans don’t require large supervised datasets to learn new NLP tasks. Our goal is to have an NLP system that can switch between NLP tasks seamlessly
How can we address this limitation?
One possible method is meta-learning, which refers to models learning a wide range of skills and pattern recognition abilities during training and then use those abilities during inference time to adapt to the desired task. An example of this would be to pre-train a language model conditioned on a natural language instruction. T5 allows you to perform many different NLP tasks by simply adding different prefixes to the input data. However, meta-learning doesn’t yet achieve results close to fine-tuning approach.
Another method is increasing the size of language models. Studies have shown that log loss correlates well with scalability of language models. Since in-context learning involves learning many skills and pattern recognition abilities in terms of parameters, it’s possible that in-context learning abilities improve with scaling up the language models.
What is GPT-3?
GPT-3 is a 175 billion parameter language model that’s used to measure the effect of in-context learning abilities. It’s evaluated over two dozen NLP datasets under 3 conditions:
The figure below showcase an example of the 3 conditions on a simple NLP task. Model performance improved significantly with prompts (task description) and the number of examples in model’s context. In addition, few-shot learning has a steeper improvement with larger language model.
What’s so special about GPT-3?
It’s able to achieve strong results in zero-shot and one-shot settings in most NLP tasks, indicating less reliant on large supervised datasets. In few-shot setting, GPT-3 was able to generate synthetic news articles that human evaluators find difficult to distinguish. The figure below showcase the aggregate results of GPT-3 across 42 benchmarks.