Objectives and Contributions
Applied topic modelling to quarterly earnings call transcripts of companies. The earnings call transcripts are relative unstructured and consists of Q&A session. The objective here is to discover a set number of topics per transcript as well as their respective segments. This task has few major steps as shown in the figure below:
Data Gathering & Preprocessing
We scraped 3800 earnings call transcripts from seekingalpha.com. We processed the transcripts by removing the HTML markup, tokenisation, removal of stop words, and stemming.
Vector space model
Here, we have a unigram bag of words model.
Clustering using LDA
We have chosen LDA because it allows us to represent documents using multiple topic distributions unlike K-means. However, the LDA approach initially suffered from extreme financial common words (that are not stop words). Therefore, we decided to remove words that are too common or too rare across documents. The thresholds are 50% and 2% respectively.
We followed Tagarelli and Karypis approach where we split each document into paragraphs and cluster them into different segments within the document. This would allow us to capture companies that are involved in multiple industries. Given that call transcripts have many paragraphs, some of which are “noise”, we decided to remove any paragraphs that are fewer than 100 characters.
The intra-document clustering didn’t work as expected because there are many common financial terms that we failed to exclude.
Clustering Document Segments
The whole clustering process is shown in the figure below. Once we have multiple clusters within each document, we perform a further clustering with K = 5 and 10. No significant changes were observed using this methodology given that most companies operate primarily in a single industry.