Topic Modelling & Sentiment Analysis

The goal here is to a) identify the topics within news articles and b) identify the sentiment of each topic. To achieve this, our approach is as follows:

  1. Create the topic modelling class – TopicModel()
  2. Load and process data (we only parse 10K data, otherwise it takes too long)
  3. Create dictionary, bow corpus, and topic model
  4. Topic analysis –> Finding the dominant topic for each document AND finding the topic distribution amongs our 10K data
  5. Compute topic model’s coherence score
  6. Use the coherence score to determine the optimal number of topics (30 in our case)
  7. Document-level sentiment analysis
  8. Sentence-level topic modelling and sentiment analysis
  9. Visualisations –> Plot all the topics and respective sentiments within a document AND plot the change in topic sentiment across article datetime
  10. Similarity matrix to measure how similar new documents are to our existing documents. If it’s too similar, duplicate content

The outputs of this notebook are dictionary, bow corpus, trained topic model, and similarity matrix as well as the pipeline to extract topics and sentiments at document and sentence level!

Import Dependencies

In [64]:
from nltk.tokenize import sent_tokenize

# Others
import pickle
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Gensim
import gensim
from gensim import corpora, models
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

Topic Modelling Class

In [65]:
class TopicModel():

    def __init__(self):
        self.no_topics = 30
        self.no_below = 15
        self.no_above = 0.5
        self.keep_top_n = 100000

        self.dictionary = None
        self.bow_corpus = None
        self.ldamodel = None

    def create_dictionary(self, processed_data):
        self.dictionary = corpora.Dictionary(processed_data)
        count = 0
        for k, v in self.dictionary.iteritems():
            print(k, v)
            count += 1
            if count > 10:
                break

        self.filter_dictionary(self.no_below, self.no_above, self.keep_top_n)

    def filter_dictionary(self, no_below, no_above, keep_top_n):
        """
        Filter out tokens that appear in:
            less than no_below documents, 
            more than no_above (fraction form of total corpus size),
            and keep the first keep_top_n most frequent tokens
        """
        self.dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=keep_top_n)

        print("Saved dictionary as dictionary.gensim")
        self.dictionary.save('dictionary.gensim')

    def create_bow(self, processed_data):
        self.bow_corpus = [self.dictionary.doc2bow(doc) for doc in processed_data]
        print("Saved corpus as bow_corpus.pkl")
        pickle.dump(self.bow_corpus, open('bow_corpus.pkl', 'wb'))

    def create_Lda_Model(self):
        self.ldamodel = models.ldamodel.LdaModel(self.bow_corpus, num_topics=self.no_topics, id2word=self.dictionary,
                                                 passes=15)
        print("Saved LDA model")
        self.ldamodel.save('topic_model.gensim')

        return self.ldamodel

    def predict(self, processed_new_doc):
        new_doc_bow = self.dictionary.doc2bow(processed_new_doc)
        
        return self.ldamodel.get_document_topics(new_doc_bow)
        
    def format_topics_sentences(self, ldamodel, corpus, texts):
        # Init output
        sent_topics_df = pd.DataFrame()
    
        # Get main topic in each document
        for i, row in enumerate(ldamodel[corpus]):
            row = sorted(row, key=lambda x: (x[1]), reverse=True)
            # Get the Dominant topic, Perc Contribution and Keywords for each document
            for j, (topic_num, prop_topic) in enumerate(row):
                if j == 0:  # => dominant topic
                    wp = ldamodel.show_topic(topic_num)
                    topic_keywords = ", ".join([word for word, prop in wp])
                    sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
                else:
                    break
        sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    
        # Add original text to the end of the output
        contents = pd.Series(texts)
        sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
        
        df_dominant_topic = sent_topics_df.reset_index()
        df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
        
        return df_dominant_topic

Simple preprocessing function

In [66]:
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v')


def preprocess(text):
    clean_data = []

    for token in simple_preprocess(text, deacc=True):
        if token not in STOPWORDS and len(token) > 4:
            clean_data.append(lemmatize(token))

    return clean_data

Loading data (only 10K data – otherwise, it takes too long)

In [67]:
print("Loading data...")
data = pd.read_csv('news_data.csv')
data_text = data[:10000]

data_text['index'] = data_text.index
documents = data_text

print("Length of documents: %i" % len(documents))
print("Top 5 entries:")
print(documents[:5])
Loading data...
Length of documents: 10000
Top 5 entries:
                                         id  \
0  4f2fec4a4d32d0f564e5da74188b51e5317e4826   
1  4a9f98e22b7a76a6db40aef6accb24b1688eecb3   
2  941e5abdc725739fed17da543d6c449d342b8506   
3  038916903c446a14b1096e673c86b74bfbf5cbcc   
4  4fe331b8aedfe7386bfa11c5768629ba068f6bcc   

                                               title  \
0  EMERGING MARKETS-Mexican peso seesaws over dol...   
1  Migrants must visit Nazi concentration camps, ...   
2     Euro zone businesses start 2018 on decade high   
3  Russia's Lavrov says 'unilateral actions' by U...   
4  Lawmakers to Justice Department: Keep online g...   

                                                 url  \
0  https://www.reuters.com/article/emerging-marke...   
1  https://www.reuters.com/article/us-germany-ant...   
2  https://www.reuters.com/video/2018/01/24/euro-...   
3  https://www.reuters.com/article/us-mideast-cri...   
4  https://www.cnbc.com/2018/01/12/the-associated...   

                                              social  \
0  {'gplus': {'shares': 0}, 'pinterest': {'shares...   
1  {'gplus': {'shares': 0}, 'pinterest': {'shares...   
2  {'gplus': {'shares': 0}, 'pinterest': {'shares...   
3  {'gplus': {'shares': 0}, 'pinterest': {'shares...   
4  {'gplus': {'shares': 0}, 'pinterest': {'shares...   

                                                text  \
0  (Updates prices, adds Trump comments) By Rodri...   
1  BERLIN (Reuters) - New migrants to Germany mus...   
2  Euro zone businesses start 2018 on decade high...   
3  MOSCOW (Reuters) - “Unilateral actions” by the...   
4  ATLANTIC CITY, N.J. (AP) — Federal lawmakers w...   

                                            entities  spam_score  \
0  {'persons': [{'name': 'argen', 'sentiment': 'n...       0.000   
1  {'persons': [{'name': 'josef schuster', 'senti...       0.005   
2  {'persons': [{'name': 'david pollard', 'sentim...       0.000   
3  {'persons': [{'name': 'lavrov', 'sentiment': '...       0.000   
4  {'persons': [{'name': 'cory booker', 'sentimen...       0.000   

                       published  index  
0  2018-01-26T01:01:00.000+02:00      0  
1  2018-01-10T21:52:00.000+02:00      1  
2  2018-01-24T19:14:00.000+02:00      2  
3  2018-01-21T20:31:00.000+02:00      3  
4  2018-01-12T16:55:00.000+02:00      4

Apply Preprocessing to our 10K dataset

In [68]:
print("Processing data for LDA...")
processed_docs = data_text['text'].map(preprocess)
print("Done")

print("\n\nOriginal document: ")
print(data_text['text'][0])

print("\n\nTokenised and lemmatised document: ")
print(processed_docs[0])
Processing data for LDA...
Done

Original document: 
(Updates prices, adds Trump comments) By Rodrigo Campos NEW YORK, Jan 25 (Reuters) - Mexico's peso seesawed against the dollar on Thursday as U.S. officials sent mixed signals on the greenback, while Argentina's Merval stock index broke the 35,000-point mark for the first time. Several emerging currencies hit multi-year highs against the greenback, with the dollar index languishing at more than three-year lows after U.S. Treasury Secretary Steven Mnuchin departed from traditional U.S. currency policy, saying "obviously a weaker dollar is good for us." The Mexican peso appreciated by more than 1 percent to 18.3025 earlier in the day before U.S. President Donald Trump said Mnuchin had been misinterpreted and that he ultimately wanted the dollar to be strong. Trump's comments helped the dollar to pare losses against major currencies, and the Mexican peso reversed its gains, closing down almost 0.6 percent against the greenback. Elsewhere, Colombia's peso added to Wednesday's 1.48 percent gain against the dollar to reach its strongest level since July 2015, while the Chilean peso closed under 600 per dollar for the first time since May 2015. Brazilian markets were closed for the Sao Paulo anniversary holiday but are expected to soon extend a rally that boosted the benchmark Bovespa stock index to an all-time high above 83,000 points on Wednesday. That advance came after an appeals court upheld a corruption conviction of former President Luiz Inacio Lula da Silva. Although the conviction could derail his plans to run again for the presidency, Lula, who is leading opinion polls for the October election, said on Thursday he would appeal the decision. Brazilian and Argentine shares have led a Latin American equities rally to start the year that has MSCI's gauge of the region's stocks set for its largest January gains since 2006. The Merval closed up 0.55 percent at 35,141.72 points. Key Latin American stock indexes and currencies at 2145 GMT: Stock indexes Latest Daily YTD pct pct change change MSCI Emerging Markets 1,263.45 0.37 9.06 MSCI LatAm 3,201.95 1.01 13.22 Mexico IPC 50,777.90 0.27 3.09 Chile IPSA 5,811.54 0.23 4.44 Chile IGPA 29,216.54 0.23 4.42 Argentina Merval 35,141.72 0.55 16.88 Colombia IGBC 12,307.20 -0.06 8.24 Currencies Latest Daily YTD pct pct change change Brazil real 3.1470 0.35 5.28 Mexico peso 18.6100 -0.57 5.85 Chile peso 598.60 0.7 2.68 Colombia peso 2,790 0.84 6.88 Peru sol 3.210 0.09 0.84 Argentina peso (interbank) 19.56 0.38 -4.91 Argentina peso (parallel) 19.91 0.35 -3.42 (Reporting by Rodrigo Campos; Editing by Bernadette Baum and Chris Reese)


Tokenised and lemmatised document: 
['update', 'price', 'trump', 'comment', 'rodrigo', 'campos', 'reuters', 'mexico', 'seesaw', 'dollar', 'thursday', 'officials', 'mix', 'signal', 'greenback', 'argentina', 'merval', 'stock', 'index', 'break', 'point', 'emerge', 'currencies', 'multi', 'highs', 'greenback', 'dollar', 'index', 'languish', 'treasury', 'secretary', 'steven', 'mnuchin', 'depart', 'traditional', 'currency', 'policy', 'say', 'obviously', 'weaker', 'dollar', 'mexican', 'appreciate', 'percent', 'earlier', 'president', 'donald', 'trump', 'mnuchin', 'misinterpret', 'ultimately', 'want', 'dollar', 'strong', 'trump', 'comment', 'help', 'dollar', 'losses', 'major', 'currencies', 'mexican', 'reverse', 'gain', 'close', 'percent', 'greenback', 'colombia', 'add', 'wednesday', 'percent', 'dollar', 'reach', 'strongest', 'level', 'chilean', 'close', 'dollar', 'brazilian', 'market', 'close', 'paulo', 'anniversary', 'holiday', 'expect', 'extend', 'rally', 'boost', 'benchmark', 'bovespa', 'stock', 'index', 'point', 'wednesday', 'advance', 'appeal', 'court', 'uphold', 'corruption', 'conviction', 'president', 'inacio', 'silva', 'conviction', 'derail', 'plan', 'presidency', 'lead', 'opinion', 'poll', 'october', 'election', 'thursday', 'appeal', 'decision', 'brazilian', 'argentine', 'share', 'latin', 'american', 'equities', 'rally', 'start', 'gauge', 'region', 'stock', 'largest', 'january', 'gain', 'merval', 'close', 'percent', 'point', 'latin', 'american', 'stock', 'index', 'currencies', 'stock', 'index', 'latest', 'daily', 'change', 'change', 'emerge', 'market', 'latam', 'mexico', 'chile', 'chile', 'argentina', 'merval', 'colombia', 'currencies', 'latest', 'daily', 'change', 'change', 'brazil', 'mexico', 'chile', 'colombia', 'argentina', 'interbank', 'argentina', 'parallel', 'report', 'rodrigo', 'campos', 'edit', 'bernadette', 'chris', 'reese']

Create the dictionary, doc2bow, build the topic model

In [69]:
lda = TopicModel()
lda.create_dictionary(processed_docs)
lda.create_bow(processed_docs)
ldaModel = lda.create_Lda_Model()
0 add
1 advance
2 american
3 anniversary
4 appeal
5 appreciate
6 argentina
7 argentine
8 benchmark
9 bernadette
10 boost
Saved dictionary as dictionary.gensim
Saved corpus as bow_corpus.pkl
Saved LDA model
In [70]:
corpus = pickle.load(open('bow_corpus.pkl', 'rb'))

Topic Analysis

  • Dominant Topic per document
  • Topic contribution within our corpus
In [71]:
dominant_topic_breakdown = lda.format_topics_sentences(ldamodel = ldaModel, corpus = corpus, texts = data_text['text'])
    
topic_counts = pd.DataFrame(dominant_topic_breakdown['Dominant_Topic'].value_counts()).reset_index().rename(columns = {'index':'Dominant_Topic', 
                           'Dominant_Topic': 'Topic_Counts'})
topic_counts['Topic_Contribution'] = topic_counts['Topic_Counts'].apply(lambda x: round(x/topic_counts['Topic_Counts'].sum(), 4))
topic_num_keywords = dominant_topic_breakdown[['Dominant_Topic', 'Keywords']].drop_duplicates(subset='Dominant_Topic', keep = 'first')

dominant_topics = pd.merge(topic_counts, topic_num_keywords, on = 'Dominant_Topic')
dominant_topics = dominant_topics[['Dominant_Topic','Keywords', 'Topic_Counts', 'Topic_Contribution']]
In [72]:
dominant_topic_breakdown.head()
Out[72]:
Document_No Dominant_Topic Topic_Perc_Contrib Keywords Text
0 0 6.0 0.5652 percent, market, stock, price, growth, share, … (Updates prices, adds Trump comments) By Rodri…
1 1 14.0 0.4605 party, government, election, germany, minister… BERLIN (Reuters) – New migrants to Germany mus…
2 2 6.0 0.5076 percent, market, stock, price, growth, share, … Euro zone businesses start 2018 on decade high…
3 3 18.0 0.3735 north, korea, south, police, korean, state, sa… MOSCOW (Reuters) – “Unilateral actions” by the…
4 4 10.0 0.3566 media, apple, company, facebook, social, conte… ATLANTIC CITY, N.J. (AP) — Federal lawmakers w…
In [75]:
dominant_topic_breakdown['title'] = data_text['title']
In [77]:
dominant_topics
Out[77]:
Dominant_Topic Keywords Topic_Counts Topic_Contribution
0 7.0 company, source, million, coverage, eikon, sha… 1360 0.1360
1 6.0 percent, market, stock, price, growth, share, … 1236 0.1236
2 16.0 company, service, business, market, president,… 699 0.0699
3 29.0 trump, president, house, donald, state, white,… 542 0.0542
4 18.0 north, korea, south, police, korean, state, sa… 508 0.0508
5 9.0 match, january, round, australian, second, upd… 418 0.0418
6 14.0 party, government, election, germany, minister… 397 0.0397
7 21.0 china, energy, chinese, price, production, mil… 390 0.0390
8 10.0 media, apple, company, facebook, social, conte… 366 0.0366
9 19.0 world, think, davos, go, women, economic, foru… 334 0.0334
10 3.0 security, company, service, financial, review,… 323 0.0323
11 23.0 store, sales, company, brand, drive, market, r… 317 0.0317
12 1.0 state, military, force, government, turkey, at… 306 0.0306
13 17.0 people, money, years, company, business, schoo… 298 0.0298
14 11.0 statements, forward, look, company, result, in… 291 0.0291
15 26.0 quarter, million, income, share, operate, reve… 272 0.0272
16 2.0 conference, company, release, february, financ… 262 0.0262
17 24.0 britain, london, british, brexit, european, po… 224 0.0224
18 5.0 investment, capital, company, share, trust, sh… 183 0.0183
19 22.0 study, people, health, percent, children, univ… 165 0.0165
20 15.0 health, medical, healthcare, patients, clinica… 151 0.0151
21 28.0 court, claim, file, rule, legal, state, federa… 146 0.0146
22 8.0 trade, bitcoin, exchange, currency, dollar, cr… 141 0.0141
23 12.0 offer, share, securities, stock, note, prospec… 139 0.0139
24 27.0 billion, group, bank, finance, insurance, euro… 128 0.0128
25 20.0 point, game, score, second, sport, season, car… 110 0.0110
26 25.0 million, fund, raise, save, billion, base, ven… 90 0.0090
27 4.0 million, income, total, december, loan, quarte… 87 0.0087
28 0.0 canada, trade, mexico, unite, american, state,… 63 0.0063
29 13.0 israel, boeing, airbus, israeli, aircraft, jer… 54 0.0054

Topic Model’s Coherence Score

To find the optimal k topics, we should aim for the lowest perplexity or highest coherence score. We can find k by building many LDA models with different number of topics.

In [126]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldaModel, texts=processed_docs, dictionary=lda.dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score:  0.5708177278694123
In [128]:
coherence_values = []
model_list = []
for num_topics in range(10, 50, 10):
    print("Training topic model with %s topics" % num_topics)
    model = models.ldamodel.LdaModel(corpus = corpus, num_topics=num_topics, id2word=lda.dictionary, passes=15)
    model_list.append(model)
    coherencemodel = CoherenceModel(model=model, texts=processed_docs, dictionary=lda.dictionary, coherence='c_v')
    coherence_values.append(coherencemodel.get_coherence())
Training topic model with 10 topics
Training topic model with 20 topics
Training topic model with 30 topics
Training topic model with 40 topics
In [129]:
limit=50; start=10; step=10;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

Inference on new document

In [78]:
# Predict unseen title
new_doc = 'The britain government is under lots of pressure'
new_doc = dominant_topic_breakdown['Text'][0]
new_doc = preprocess(new_doc)
lda.predict(new_doc)
Out[78]:
[(0, 0.0894398),
 (6, 0.5651851),
 (7, 0.070502624),
 (8, 0.07264299),
 (9, 0.021440797),
 (14, 0.05765931),
 (18, 0.034007233),
 (28, 0.01271173),
 (29, 0.063332476)]

Document-level sentiment analysis

In [79]:
analyzer = SentimentIntensityAnalyzer()
In [80]:
def sentiment_score(text):
    score = analyzer.polarity_scores(text)
    return score['pos'] - score['neg']
In [285]:
def sentiment_score_compound(text):
    score = analyzer.polarity_scores(text)
    return score['compound']
In [81]:
dominant_topic_breakdown['sentiment'] = dominant_topic_breakdown['Text'].map(sentiment_score)
In [82]:
dominant_topic_breakdown.head()
Out[82]:
Document_No Dominant_Topic Topic_Perc_Contrib Keywords Text title sentiment
0 0 6.0 0.5652 percent, market, stock, price, growth, share, … (Updates prices, adds Trump comments) By Rodri… EMERGING MARKETS-Mexican peso seesaws over dol… 0.032
1 1 14.0 0.4605 party, government, election, germany, minister… BERLIN (Reuters) – New migrants to Germany mus… Migrants must visit Nazi concentration camps, … 0.020
2 2 6.0 0.5076 percent, market, stock, price, growth, share, … Euro zone businesses start 2018 on decade high… Euro zone businesses start 2018 on decade high 0.000
3 3 18.0 0.3735 north, korea, south, police, korean, state, sa… MOSCOW (Reuters) – “Unilateral actions” by the… Russia’s Lavrov says ‘unilateral actions’ by U… -0.018
4 4 10.0 0.3566 media, apple, company, facebook, social, conte… ATLANTIC CITY, N.J. (AP) — Federal lawmakers w… Lawmakers to Justice Department: Keep online g… 0.070
In [134]:
ax = dominant_topic_breakdown.groupby(['Dominant_Topic'])['sentiment'].mean().plot(kind="bar")
ax.set_ylabel('Average Sentiment')
ax.set_title('Average Sentiment per Topic')
Out[134]:
Text(0.5, 1.0, 'Average Sentiment per Topic')
Ryan

Ryan

Data Scientist

Leave a Reply