Topic modelling is an unsupervised problem. With topic modelling, you have to choose (in advance) how many clusters (or topics) we would like to group the documents.

### There are two matrix decomposition techniques

- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorisation (NMF)

### What does decomposing a matrix means?

It means we use multiple matrices (matrix product) to represent a matrix. This is sometimes known as Latent Semantic Analysis (LSA), which uses SVD

### Data processing

- Stop words
- Stemming
- Lemmatisation
- Sub-word units
- For example, byte-pair-encoding (BPE)

- CountVectorizer
- Word counts

- TfidfVectorizer

### Spacy

Spacy has its own list of stop words and lemmatiser

### Singular Value Decomposition (SVD)

The SVD algorithm factorises a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix that contains the relative importance of each factor). [insert image]. Topics are expected to be orthogonal to each other. SVD is an exact decomposition!

SVD is widely used in linear algebra. Some of its applications in data science include:

- Semantic analysis
- Collaborative filtering (recommendation systems)
- Data compression
- PCA

Truncated SVD behaves similar to NMF in that we are only interested in the vectors corresponding to the largest singular values.

### Non-negative Matrix Factorisation (NMF)

NMF involves factorising a non-negative dataset V into non-negative matrices W and H. NMF is a non-exact factorisation and is non-unique! NMF is a lot faster than SVD because we only calculate the subset of columns that we are interested in. You can use sklearn to implement NMF.

Applications of NMF include:

- Face decompositions
- Collaborative filtering
- Audio source separation
- Bioinformatics and gene expression
- Topic Modelling