What is the process of training a language model and fine-tuning it for downstream tasks?

  1. Download huge amount of text corpus (Wikipedia articles, news articles etc)

  2. Process those data so that they are clean

  3. Create and train the language model using fast.ai or hugging face python library

  4. Once you are done training, you will have two files: the trained weights and the vocabulary

  5. Fine-tune our trained language model using the downstream dataset by training the last layer first, then unfreeze the other layer and train the whole language model. By the end of this process, you will have two files: the new weights and the fine-tuned encoder

  6. Put a classifier above this fine-tuned encoder and train the classifier on the downstream tasks

What is a backwards model?

It is a NLP trick that can improve any model. A backwards model is where you train the language model using a REVERSED input. What this means is that by training the language model on reverse input, your language model will learn to predict the previous words instead of the next word. Once you have trained the backward language model, you can ensemble it with the normal language model for your downstream tasks and it’s most likely to improve the results.

[PRACTICAL] What is parallel command line?

It allows you to run any processing programs in multiple processors, speeding up your processing time.

What is sentence piece and byte pair encoding?

Byte pair encoding (BPE) segment words into subword units. A subword unit is a sequence of characters that appear frequently in a text corpus. What this means it’s that your tokeniser would tokenise inputs into these sequence of characters that are common in your dataset. What’s strong about BPE is that it’s a combination of character and word tokenisation in the sense that BPE builds words in your dictionary from the ground up. For example, you might identify that ‘th’ and ‘e’ are frequent sequence of characters in your corpus and so those becomes a unique token in your vocabulary after the first pass. Once you have those different tokens, you also realised that ‘th’ and ‘e’ appear as a sequence of characters that commonly appear next to each other in the word ‘the’ and so ‘the’ is also in your vocabulary now. This means that your vocabulary will contain most common words as well as sequence of characters that are common in your corpus.

SentencePiece is an unsupervised tokeniser that works similarly to BPE except that it is a neural network that operates at the character level and aims to output sequence of characters that are most common in the dataset.

When would subword units be most useful?

Subword units are most useful in languages where they might require different kind of processing such as Chinese and Turkish when there aren’t spaces. In addition, subword units are also useful for building a vocab for a specific domain like healthcare, for example, long unique medicine names.

How does ULMFit make use of semi-supervised learning for IMDB movie reviews?

It uses unlabelled movie reviews to train the language model and uses the classifier to fine-tune on the downstream task.

Ryan

Ryan

Data Scientist

Leave a Reply