Topic modelling is an unsupervised problem. With topic modelling, you have to choose (in advance) how many clusters (or topics) we would like to group the documents.

### There are two matrix decomposition techniques

• Singular Value Decomposition (SVD)
• Non-negative Matrix Factorisation (NMF)

### What does decomposing a matrix means?

It means we use multiple matrices (matrix product) to represent a matrix. This is sometimes known as Latent Semantic Analysis (LSA), which uses SVD

### Data processing

• Stop words
• Stemming
• Lemmatisation
• Sub-word units
• For example, byte-pair-encoding (BPE)
• CountVectorizer
• Word counts
• TfidfVectorizer

### Spacy

Spacy has its own list of stop words and lemmatiser

### Singular Value Decomposition (SVD)

The SVD algorithm factorises a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix that contains the relative importance of each factor). [insert image]. Topics are expected to be orthogonal to each other. SVD is an exact decomposition!

SVD is widely used in linear algebra. Some of its applications in data science include:

• Semantic analysis
• Collaborative filtering (recommendation systems)
• Data compression
• PCA

Truncated SVD behaves similar to NMF in that we are only interested in the vectors corresponding to the largest singular values.

### Non-negative Matrix Factorisation (NMF)

NMF involves factorising a non-negative dataset V into non-negative matrices W and H. NMF is a non-exact factorisation and is non-unique! NMF is a lot faster than SVD because we only calculate the subset of columns that we are interested in. You can use sklearn to implement NMF.

Applications of NMF include:

• Face decompositions
• Collaborative filtering
• Audio source separation
• Bioinformatics and gene expression
• Topic Modelling

Data Scientist