Topic modelling is an unsupervised problem. With topic modelling, you have to choose (in advance) how many clusters (or topics) we would like to group the documents.

There are two matrix decomposition techniques

  • Singular Value Decomposition (SVD)
  • Non-negative Matrix Factorisation (NMF)

What does decomposing a matrix means?

It means we use multiple matrices (matrix product) to represent a matrix. This is sometimes known as Latent Semantic Analysis (LSA), which uses SVD

Data processing

  • Stop words
  • Stemming
  • Lemmatisation
  • Sub-word units
    • For example, byte-pair-encoding (BPE)
  • CountVectorizer
    • Word counts
  • TfidfVectorizer

Spacy

Spacy has its own list of stop words and lemmatiser

Singular Value Decomposition (SVD)

The SVD algorithm factorises a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix that contains the relative importance of each factor). [insert image]. Topics are expected to be orthogonal to each other. SVD is an exact decomposition!

SVD is widely used in linear algebra. Some of its applications in data science include:

  • Semantic analysis
  • Collaborative filtering (recommendation systems)
  • Data compression
  • PCA

Truncated SVD behaves similar to NMF in that we are only interested in the vectors corresponding to the largest singular values.

Non-negative Matrix Factorisation (NMF)

NMF involves factorising a non-negative dataset V into non-negative matrices W and H. NMF is a non-exact factorisation and is non-unique! NMF is a lot faster than SVD because we only calculate the subset of columns that we are interested in. You can use sklearn to implement NMF.

Applications of NMF include:

  • Face decompositions
  • Collaborative filtering
  • Audio source separation
  • Bioinformatics and gene expression
  • Topic Modelling
Ryan

Ryan

Data Scientist

Leave a Reply