What are the three popular algorithms for topic modelling?
LDA – Latent Dirichlet Allocation
LSA – Latent Semantic Analysis
NMF – Non-negative Matrix Factorisation
All three algorithms require you to pre-select the number of topics. They all require the document-word matrix as input and output two matrices as follows. The idea behind this is that the multiplication of the two output matrices should approx. give you the document-word matrix. Using the topic document matrix, you get to see what portions of topics the document is made up of.
Word topic matrix
Topic document matrix
Describe Latent Semantic Analysis (LSA).
LSA is based on distributional hypothesis, which states that the semantics of words can be determined using the contexts the words appear in. Therefore, two words can be classified as similar if they tend to appear in similar contexts.
LSA computes word frequencies in individual documents and the whole corpus. It’s assume that similar documents will have the similar word distributions. Therefore, the syntactic information (word orders) and semantic information are ignored. Each document is essentially just a BoWs.
The common way to compute word frequencies is TFIDF, which computes frequencies by taking into account both how frequent the word appear in the given document and also how frequent the word appear in the whole corpus. Each word will have its own TFIDF score. Once we have computed the TFIDF for all the words, we can create the document-term matrix using those scores. This is shown in the figure below.
This document-term matrix is decomposed into the product of 3 matrices using singular value decomposition. Two of the matrices are document-topic matrix and word-topic matrix.
Describe Latent Dirichlet Allocation (LDA).
The purpose of LDA is to map each document in our dataset to a set of topics which covers most of the words in the document. It does so by assigning set of words to different topics. LDA also ignores syntactic information and treats documents as BoWs.
The main difference between LSA and LDA is that LDA assumes the topics adn words distributions are dirichlet distributions. There are two key hyperparameters for LDA: alpha and beta. Alpha controls how many topics to model each document whereas beta controls how many words to model a topic.