Describe BPE.

BPE was first introduced as a data compression technique by replacing common pairs of consecutive bytes with a single byte that doesn’t appear in the data. For example, “aaabdaaabac” –> “ZabdZabac” where Z = aa. Using BPE for subword tokenisation means that frequent subword pairs are merged together. The subword tokenisation allows us to encode rare words that we didn’t see before during training and behaves in between character- and word-level representations.

What’s the difference between BPE and WordPiece?

BPE uses the next highest frequency pair to build the subword vocabulary. WordPiece uses a language model to estimate the likelihood of a new word unit. SentencePiece is the library that allows you to train your tokeniser based on your data for encoding and decoding.

What’s the difference between Word2Vec, GloVe, and FastText?

Word2Vec has two versions: Skipgram and CBOW. There are few limitations to Word2Vec. One is training time increases as corpus increases due to Word2Vec uses a sliding window to capture the co-occurrence of words. GloVe is an extension of Word2Vec that combines global matrix factorisation and local context window. The global matrix factorisation allows us to reduce the rank and size of large term-frequency matrices, allowing us to train faster and more scalable to large corpuses. Both Word2Vec and GloVe can’t captures OOV words which brings us to FastText. FastText uses ngrams to represent words.

Why do we perform normalisation?

Normalisation allows us to decrease model’s training time because:

  1. Normalise each feature so that we can assess the main contribution of every feature

  2. Reduces internal covariate shift, which refers to the change in distribution of network activations due to change in the parameters

  3. Makes optimisation faster as it prevents weights from exploding

  4. Also provide slight regularisation

The difference between batch and layer normalisation is that batch normalisation performs normalisation based on batches and training examples whereas layer normalisation performs normalisation through each feature dimensions.

Ryan

Ryan

Data Scientist

Leave a Reply