• One-hot representation
    • Sparse and dimensionality will increase with vocabulary size
    • Can’t represent relationships among words
  • Distributed Word Representation
    • Addresses the issues in one-hot by encoding words to a continuous low-dimensional vectors
    • Closely related words are close to each other in this vector space
  • Popular techniques: Word2Vec, GloVe, Fasttext


  • However, distributed word presentation still can’t accurately capture all the contextual information. Word vectors in Word2Vec (and others) are constant regardless of different contexts
  • Contextualised word embeddings allow word embedding of the same word to differ based on which context the word appears in
  • Examples: CoVE, ELMo, GPT, BERT
  • CoVE
    • Train LSTM encoders on a large-scale English-to-German translation dataset
    • In MT task, the encoder encodes words in the context and so the output of the encoder can be seen as context vectors
    • For MRC, we concatenate this context vector with the context and question GloVe embeddings and feed into a deep neural network. Performance improved!
  • ELMo
    • Pre-train a bidirectional language model with a large text corpus
    • ELMo collapse outputs of all biLM layers into a single vector with a task-specific weighting
    • It has been shown that different levels of LSTM states can capture different syntactic and linguistic information
  • GPT
    • Semi-supervised approach. Unsupervised pre-training and supervised fine-tuning
    • Basic component of GPT is a multi-layer transformer decoder to train the language model (unsupervised)
    • Then fine-tune it to specific downstream tasks (supervised)
  • BERT
    • Bidirectional encoder representation from transformers
    • Using the masked LM and next-sentence prediction task, BERT can pre-train contextualised representations with a bidirectional transformer

Multiple Granularity

  • Character Embeddings
    • Can alleviate out-of-vocabulary problem as well as modelling sub-word morphologies
  • Part-of-Speech (PoS) Tags
  • Name-Entity Tags
  • Binary Feature of Exact Match
    • Measures whether a context word is in the question!
    • Research work has extend this to partial matching to measure the correlation between context words and question words
  • Query-Category
    • The types of questions (what, where, who, when, why, how) can usually provide good insight for the machine to search for the answer.
    • The query-category embeddings are often added to the query word embeddings


Data Scientist

Leave a Reply