Why we freeze the weights in all layers except the final layer when fine-tuning to new datasets?

With the fast.ai library, you can load popular models like ResNet34 that are already pre-trained. When you feed the data, the fast.ai library will automatically freeze the weights in all layers except the final layer and fine tune the weights on the last layer based on the new dataset. The reason why we only fine-tune the final layer to begin with is because the last layer of a deep learning architecture are usually a linear layer that maps the encoded hidden vectors into task-specific labels. This means that when you are applying pre-trained models to a new dataset with different labels, the last layer becomes irrelevant and so fast.ai (under the hood) initialise the last layer of the pre-trained model to be random and fine-tune it first while keeping the weights of the other important layer frozen (transfer learning).

Once you have fine-tune and update the weight of the final layer, you can save the model, unfreeze the weights in the other layers, and train the model again to update the weights in all the layers. This should push the performance of the model even higher and should be your final model.

For transfer learning, how would you set discriminative learning rate?

Discriminative learning rate is when you set different learning rate for different layers, usually lower learning rates at early layers and higher learning rates at later layers.

So, for convolutional neural network, we know that the layers at the beginning are good at learning general features whereas the layers towards the end of the architecture learn specific features. When we unfreeze all the layers and are ready to update the weights, we can set the learning rate to be different for different layers. Specifically, we want the first layer to have a lowest learning rate and the final layer to have the highest learning rate. The layers in between should have learning rate ranging between the lowest and the highest learning rate.

The reason for the different learning rate for different layers is because layers at the beginning requires very little fine-tuning since it mainly captures general features. With fast.ai you can perform this change in learning rate as follows:

learn.fit_one_cycle(2, max_lr=slice(1e-6, 1e-4))

How to have a more efficient deep learning workflow?

One efficient trick is to use 1% of your data to do 99% of your deep learning work. What this means is that you should only use a small set of sample to build your deep learning codes and avoid having to wait a lengthy time for every steps to finish. Once you are happy with your code, then you can increase the training data size.

How is transfer learning in NLP different than CV?

The main idea is that with NLP, we fine tune the language model before fine tuning the downstream tasks whereas in CV, we just fine-tune for the downstream task.

In NLP, we use language model for transfer learning and we can specifically fine-tune the language model using our target dataset before fine-tuning it further for downstream task such as sentiment analysis. This creates a better “transfer” as your language model is now design for the domain language of your target dataset. For example, you could download a pre-trained language model that’s good at predicting the next word of Wikipedia-style text but if your target dataset is movie reviews, you can fine-tune your language model so that it becomes good at predicting the next word of movie-review-style text instead.

[PRACTICAL] How to dig through codes where you are required to chain functions from multiple source files?

You can use VS Code to perform this:

  1. CMD+Shift+p to open to command palette

  2. Search for “#[whatever you want to search – this could be function name etc]

  3. Read the function code and if it refers to another function in another file, CMD+click on the function and it will bring you to the function code

  4. Once you are done, you can hit back to the original code function



Data Scientist

Leave a Reply