7.4 Training Neural Nets
What’s the purpose of training the neural nets?
The goal is to learn the optimal parameters for the weight matrices and bias vectors such that the output generated from the neural nets are as close to the true value as possible.
What is loss function, gradient descent, and error backpropagation?
The distance between the predicted output and the ground-truth output is capture by the loss function. The algorithm for optimising the parameters to minimise the loss function is known as gradient descent. Error backpropagation is how we are able to propagate the gradient and compute partial derivative of earlier layers w.r.t. to the loss function.
What is a computation graph and forward and backward pass?
A computation graph is a representation of the process of computing mathematical expressions. Each node represents a separate operation. See below the computation graph for L(a, b, c) = c(a + 2b). With this graph, we can do both forward and backward pass! The forward pass is simple where we feed in different inputs to generate the output.
The backward pass is where we compute all the derivatives we need to update our weights. The derivatives are the output function w.r.t. all the trainable parameters. The derivatives tells us how much the change in the parameters affect our output. The figure below showcase the backward pass.
What is overfitting and how can we prevent them?
Overfitting is where you model is overtrain to your training data, making them less accurate on your test set and future unseen data. We can use different types of regularisation to prevent overfitting. Dropout is one of the common techniques, where you randomly drop some computation units and their connections during training. Hyperparameter tuning can also help prevent overfitting by choosing appropriate learning rate, mini batch size, number of hidden layers, number of hidden units etc.
7.5 Neural Language Models
What is language modelling?
Predicting upcoming words from prior word context.
What’s the advantage of neural language models over n-gram language models?
Neural language models can handle longer context and can generalise over contexts with similar words. The use of embeddings to represents prior context is a better approach than n-gram as it allows the language models to generalise to unseen data! However, it comes at the cost of significantly longer training time.
What’s the process of parsing the inputs into a feed-forward neural language model?
The process is broken into 4 steps as shown below:
Given the window of 3 previous words, select the respective embeddings by multiply the one-hot vector with the embedding matrix. This would get us to the projection layer where we concatenate the three embeddings together
Multiply the concatenated embeddings with the weight matrix (and add the bias vector) and feed it into a ReLU hidden layer
The output of the hidden layer is multiply by U matrix
Apply softmax to get the probability distribution and output the most likely word