What are the limitations of GPT-3?
GPT-3 still has major weaknesses in text synthesis and several NLP tasks. We found that GPT-3 generated text often repeat themselves semantically, losing coherence over long passages, contradictory, and sometimes contain non-sequential sentences. In addition, we also found that GPT-3 lacks major common sense despite performing well on datasets that were designed to assess this domain.
Other limitations include structural and algorithmic domain. For example, we didn’t include any bidirectional architectures or other training objectives that previous work have identified as effective approach for certain NLP tasks such as fill-in-the-blank task. The reason for this is that we can easily sample and compute likelihood to explore in-context learning. Scaling GPT-3 with a bidirectional model and effectively incorporate the model with few-shot learning is a promising direction for future research.
Scaling language models to a bigger and bigger size might lead to the limits of the pretraining objective. The current objective weights every token equally whereas it has been proven that attention mechanism could be useful. In addition, language models are trained using self-supervised objectives but a useful language systems should be more determining the best actions rather than just making predictions. Lastly, current pretraining objective only takes into account the text modality. To better ground the language models, we can use other modalities such as images or videos that allows us to better more the world.
Another limitation is regarding sample efficiency during pre-training. In order to train a good language model, we are still required to have extremely large pre-training dataset. Improving sample efficiency is extremely important and this might be possible from providing information using other modalities.
One uncertainty regarding GPT-3 is whether few-shot learning actually learns new tasks from scratch during inference or it simply recognises the tasks it has learned during training. This uncertainty level differs from tasks to tasks and understanding how few-shot learning works is extremely important.
With big model size comes expensive and long inference time. The large size of GPT-3 means that it might be practically infeasible for certain industrial applications. A potential solution is distillation of the GPT-3, to reduce the size of the model for specific task. Aggressive distillation might be possible given that most of the learned knowledge of GPT-3 might not be required for the specific task.
Lastly, GPT-3 shares common limitations that most deep learning systems faces. It’s still remain a blackbox and the rationale behind predictions are not easily interpretable. It also suffer from data biases which it’s a big concern in ethical AI.