Day 200!! Last post on filling the gaps with interview questions. I saved the best for last. I went through Pratick Bhavsar’s NLP interview questions and below are what I consolidated on 🙂

What is perplexity?

Perplexity is a method to evaluate language models. It measures how well a probability model predicts a sample. A good language model has high probability for the right prediction and will have a low perplexity score.

What is the problem with ReLU?
  1. Exploding gradient. Can be solved using gradient clipping

  2. Dying ReLu when activation is at 0 (no learning). Use parametric ReLU instead. Parametric ReLU is a type of leaky ReLU where we train the neural network to determine the optimal slope (decay) parameter

What’s the difference in time complexity between LSTM and Transformer?

LSTM has the time complexity of the length_of_sequence x hidden_state^2. The transformer has the time complexity of length_of_sequence^2 x hidden_state. Transformer is known to be faster than LSTM and CNN because most often we have a larger hidden size than the length of sequence.

When is self-attention not faster than recurrent layers?

When the sequence length is greater than the representation dimensions. This is rare.

What is the benefit of learning rate warm-up?

Learning rate warm-up is a learning rate schedule where you have low (or lower) learning rate at the beginning of training to avoid divergence due to unreliable gradients at the beginning. As the model becomes more stable, the learning rate would increase to speed up convergence.

What’s the difference between hard and soft parameter sharing in multi-task learning?

Hard sharing is where we train for all the task at the same time and update our weights using all the losses whereas soft sharing is where we train for one task at a time.

What’s the difference between BatchNorm and LayerNorm?

BatchNorm computes the mean and variance at each layer for every minibatch whereas LayerNorm computes the mean and variance for every sample for each layer independently. Batch normalisation allows you to set higher learning rates, increasing speed of training as it reduces the unstability of initial starting weights.

What’s the main differences between GPT and BERT?

GPT is unidirectional and doesn’t use masked language model (MLM). BERT uses MLM as well as pretrained on next sentence prediction task.



Data Scientist

Leave a Reply