The below quote was the summary generated by Microsoft T-NLG:

Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|>

Massive language models (LM) have been very popular as they have improved the state of the art (SOTA) on nearly every downstream NLP tasks. Microsoft Project Turing introduced Turing Natural Language Generation (T-NLG), a 17 billion parameter LM that outperforms SOTA on various LM benchmarks.


T-NLG is a Transformer-based generative LM. They are important for NLP tasks such as QA and summarisation as they can generate answers/summaries that are accurate and fluent. The observation made here is that:

The bigger the model and the more diverser and comprehensive the pretraining data, the better the LM performs at generalising to multiple downstream tasks… therefore, might be more efficient to train a large centralised multi-task model and share its capabilities across numerous tasks rather than train a new model for every task individually.

T-NLG has 78 Transformer layers with a hidden state of 4256 and 28 attention heads!

Hardware and Software Breakthroughs

Any model with more than 1.3 billion parameters cannot be fit into a single GPU (even one with 32GB memory)! Therefore, big LM needs to be parallelised and broken into parts across multiple GPUs.

Here are few hardware and software breakthroughs used to train T-NLG:

  1. Leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections – communications between GPUs is faster than previously achieved!

  2. Apply tensor slicing to shard the model across four NVIDIA V100 GPUs

  3. DeepSpeed with ZeRO reduce model-parallelism degree, increase batch size per node by fourfold, and reduce training time by three times

Direct question answering & zero shot question capabilities

Most web search users are used to receiving an answer card at the top of the results page when they ask a question. Most of those answer is a span of sentence from the original text, highlighting the answer. T-NLG takes it one step further by providing a direct answer to the question with a complete sentence. This is very useful for power AI assistants!

Zero shot question answering means the model can answer the question without being given a context passage! Below are two examples. In this situation, the model was able to rely on knowledge gained from pretraining.

Overall, Microsoft found that the larger the pretrained model, the few instances it requires to train and learn downstream tasks. Even after only a few thousand instances of training, the model had already outperformed the LSTM baseline (all else equal).

Abstractive summarisation with less supervision

Ultimately, we would want a model to be able to write human-like summaries (abstractive) for different types of text documents. This was the goal for T-NLG. One of the main challenges is a lack of supervised training data for all the different types of documents but this was counter by the fact that T-NLG doesn’t require much supervision due to its large pretraining phase. To fine-tune T-NLG model to be able to summarise different types of text, Microsoft trained it using nearly all publicly available summarisation datasets! Below are a few output summaries:



Data Scientist

Leave a Reply