What are the four levels of adapting pre-trained models to task-specific datasets?
Fine-tuning is the most common approach. It uses large supervised dataset to fine-tune pre-trained model to the desired task. This usually requires hundred of thousands of data and it tends to lead to strong performance on many benchmarks. However, the downsides are that it’s difficult to acquire large dataset for new tasks, poor generalisation, and the potential of exploiting features of the training data, making it unfair to compare to human performance.
Few-shot learning is where the model is given few demonstrations of the task during inference time but no weights are updated. This means that we would feed the pre-trained model K examples of input and output and then one final example of input, expecting the model to output the final output. K usually ranges from 10 – 100. The good thing about few-shot learning is that it reduces the need for large task-specific data and also reduce the potential to learn a narrow distribution. However, the downside is that the results from few-shot learning has been far from SOTA fine-tuned results. One-shot learning is the same as few-shot learning except K is set to one, in addition to the natural language prompt. Zero-shot learning is when K = 0 and so only task description is provided. Zero-shot learning has the best convenience and robustness as it requires no task-specific data but it’s also the most challenging setting. The figure below showcase all four settings with examples:
What’s the training process?
To assess the ML performance on model size, we experimented with 8 different sizes of model as shown in the table below.
In terms of training data, we performed 3 data processing steps to improve the quality of Common Crawl:
Filter CommonCrawl based on similarity to a range of high-quality reference
Perform fuzzy deduplication at document level to prevent redundancy and to ensure our validation set is clean and accurate (doesn’t include samples in the training set)
Added high-quality reference corpora such as WebText, Wikipedia, and two internet-based books to augment CommonCrawl and increase its diversity. All the datasets used to train GPT-3 are displayed in the table below.
In terms of the training process, we use large batch size and smaller learning rate as models get larger as this has been shown in previous work to be effective.
What’s the main concern for pretraining models with a large volume of internet data?
The main concern is that the validation and test set of downstream tasks could have been seen during the pre-training process. Therefore, we should have an accurate measure of the contamination level as well as methods to reduce them.
How do we evaluate our GPT-3?
Whenever possible, K examples are chosen from the training /validation set and evaluated on the test set. K can range from 0 – 100. For different NLP tasks, we have different methods for feeding the context. For example, multiple choice tasks that involves selecting the correct answer from multiple options, we provide K examples of context plus correct completion, followed by one example of context only. On binary classification tasks, we feed in “True” or “False” as correct completion instead of 0 and 1.