Topic Modelling & Sentiment Analysis

The goal here is to a) identify the topics within news articles and b) identify the sentiment of each topic. To achieve this, our approach is as follows:

  1. Sentence-level topic modelling and sentiment analysis
  2. Visualisations –> Plot all the topics and respective sentiments within a document AND plot the change in topic sentiment across article datetime
  3. Similarity matrix to measure how similar new documents are to our existing documents. If it’s too similar, duplicate content

The outputs of this notebook are dictionary, bow corpus, trained topic model, and similarity matrix as well as the pipeline to extract topics and sentiments at document and sentence level!

Sentence-level topic modelling and sentiment analysis

In [277]:
def sentence_level_topic_modelling(doc_id, doc_name, doc_text):
    tuple_array = []
    
    # Sentence tokenisation
    sentences = sent_tokenize(doc_text)
    for j in range(len(sentences)):
        tuple_array.append((doc_id, doc_name, sentences[j]))
    
    # Initialise multiindex dataframe
    index = pd.MultiIndex.from_tuples(tuple_array, names = ['id', 'title', 'sentences'])
    df_news_sentences = pd.DataFrame(index = index)
    df_news_sentences['Sentiment'] = "na"
    df_news_sentences['Topics'] = "na"
    df_news_sentences['Topics_Confidence_Level'] = "na"
    
    # Sentiment Analysis
    for i in range(len(df_news_sentences)):
        score = analyzer.polarity_scores(df_news_sentences.index[i][2])
        df_news_sentences['Sentiment'][i] = score['pos'] - score['neg']
    
    # Topic Modelling
    for i in range(len(df_news_sentences)):
        topic_assignment = lda.predict(preprocess(df_news_sentences.index[i][2]))
        highest_prob_topic = max(topic_assignment, key = lambda item: item[1])
        df_news_sentences['Topics'][i] = highest_prob_topic[0]
        df_news_sentences['Topics_Confidence_Level'][i] = str(round(highest_prob_topic[1] * 100,2)) + '%'

    return df_news_sentences

News split by sentences

In [88]:
def split_sentence(df, text_column):
    tuple_array = []
    for i in range(len(df)):
        sentences = sent_tokenize(df[text_column][i])
            
        for j in range(len(sentences)):
            tuple_array.append((df['title'][i], sentences[j]))
    
    return tuple_array
In [201]:
c = split_sentence(dominant_topic_breakdown[:10000], 'Text')
In [202]:
index = pd.MultiIndex.from_tuples(c, names = ['Title', 'Sentences'])
df_news_sentences = pd.DataFrame(index = index)
df_news_sentences['Sentiment'] = "na"
df_news_sentences['Topics'] = "na"
df_news_sentences['Topics_Confidence_Level'] = "na"
In [203]:
df_news_sentences
Out[203]:
Sentiment Topics Topics_Confidence_Level
Title Sentences
EMERGING MARKETS-Mexican peso seesaws over dollar; Argentina stocks hit record (Updates prices, adds Trump comments) By Rodrigo Campos NEW YORK, Jan 25 (Reuters) – Mexico’s peso seesawed against the dollar on Thursday as U.S. officials sent mixed signals on the greenback, while Argentina’s Merval stock index broke the 35,000-point mark for the first time. na na na
Several emerging currencies hit multi-year highs against the greenback, with the dollar index languishing at more than three-year lows after U.S. Treasury Secretary Steven Mnuchin departed from traditional U.S. currency policy, saying “obviously a weaker dollar is good for us.” na na na
The Mexican peso appreciated by more than 1 percent to 18.3025 earlier in the day before U.S. President Donald Trump said Mnuchin had been misinterpreted and that he ultimately wanted the dollar to be strong. na na na
Trump’s comments helped the dollar to pare losses against major currencies, and the Mexican peso reversed its gains, closing down almost 0.6 percent against the greenback. na na na
Elsewhere, Colombia’s peso added to Wednesday’s 1.48 percent gain against the dollar to reach its strongest level since July 2015, while the Chilean peso closed under 600 per dollar for the first time since May 2015. na na na
Electronic Arts to outperform in 2018, investors ‘overreacted’ to gamer outcry: Analyst BMO Capital Markets upgraded EA’s stock to outperform Monday, arguing that investors overreacted to gamer outcry over the company’s attempts to monetize in-game content in its new “Star Wars Battlefront II” game. na na na
“Consumer pushback to EA’s Star Wars in-game monetization strategy has undermined some investor confidence and has driven the stock lower,” wrote BMO analyst Gerrick Johnson. na na na
“After further consideration, we believe the reaction may have been overdone, providing a buying opportunity for what is, otherwise, a solid long-term story.” na na na
The gaming community inundated social media and Reddit in November with thousands of critical posts saying that EA was unfairly compelling consumers to spend more money through in-game transactions to unlock new characters and other content. na na na
Following the fierce criticism, the video game company na na na

129319 rows × 3 columns

In [204]:
for i in range(len(df_news_sentences)):
    score = analyzer.polarity_scores(df_news_sentences.index[i][1])
    df_news_sentences['Sentiment'][i] = score['pos'] - score['neg']
In [205]:
for i in range(len(df_news_sentences)):
    topic_assignment = lda.predict(preprocess(df_news_sentences.index[i][1]))
    highest_prob_topic = max(topic_assignment, key = lambda item: item[1])
    df_news_sentences['Topics'][i] = highest_prob_topic[0]
    df_news_sentences['Topics_Confidence_Level'][i] = str(round(highest_prob_topic[1] * 100,2)) + '%'
In [206]:
df_news_sentences
Out[206]:
Sentiment Topics Topics_Confidence_Level
Title Sentences
EMERGING MARKETS-Mexican peso seesaws over dollar; Argentina stocks hit record (Updates prices, adds Trump comments) By Rodrigo Campos NEW YORK, Jan 25 (Reuters) – Mexico’s peso seesawed against the dollar on Thursday as U.S. officials sent mixed signals on the greenback, while Argentina’s Merval stock index broke the 35,000-point mark for the first time. -0.061 6 66.13%
Several emerging currencies hit multi-year highs against the greenback, with the dollar index languishing at more than three-year lows after U.S. Treasury Secretary Steven Mnuchin departed from traditional U.S. currency policy, saying “obviously a weaker dollar is good for us.” -0.006 8 43.94%
The Mexican peso appreciated by more than 1 percent to 18.3025 earlier in the day before U.S. President Donald Trump said Mnuchin had been misinterpreted and that he ultimately wanted the dollar to be strong. 0.105 29 33.31%
Trump’s comments helped the dollar to pare losses against major currencies, and the Mexican peso reversed its gains, closing down almost 0.6 percent against the greenback. -0.011 6 72.79%
Elsewhere, Colombia’s peso added to Wednesday’s 1.48 percent gain against the dollar to reach its strongest level since July 2015, while the Chilean peso closed under 600 per dollar for the first time since May 2015. 0.183 6 73.33%
Electronic Arts to outperform in 2018, investors ‘overreacted’ to gamer outcry: Analyst BMO Capital Markets upgraded EA’s stock to outperform Monday, arguing that investors overreacted to gamer outcry over the company’s attempts to monetize in-game content in its new “Star Wars Battlefront II” game. -0.354 6 47.1%
“Consumer pushback to EA’s Star Wars in-game monetization strategy has undermined some investor confidence and has driven the stock lower,” wrote BMO analyst Gerrick Johnson. -0.154 6 68.0%
“After further consideration, we believe the reaction may have been overdone, providing a buying opportunity for what is, otherwise, a solid long-term story.” 0.173 17 43.49%
The gaming community inundated social media and Reddit in November with thousands of critical posts saying that EA was unfairly compelling consumers to spend more money through in-game transactions to unlock new characters and other content. -0.01 10 61.61%
Following the fierce criticism, the video game company -0.293 10 60.26%

129319 rows × 3 columns

Visualisations

Show topic and sentiment breakdown of two articles

In [223]:
overall_data = []
for i in df_news_sentences.index.get_level_values('Title').unique()[:100]:
    tmp_df = df_news_sentences[df_news_sentences.index.get_level_values('Title') == i]
    
    data = []
    for name, group in tmp_df.groupby('Topics'):
        data.append((i, name, group['Sentiment'].mean()))
    overall_data.append(data)
    
for data in overall_data[:2]:
    df = pd.DataFrame(data)
    
    df.plot.bar(figsize = (18, 6), x = 1, y = 2, rot = 0, legend=False)
    plt.title(i)
    plt.xlabel('Topics')
    plt.ylabel('Average Sentiment')
In [224]:
all_dataframes = []
for i in range(len(overall_data)):
    index = pd.MultiIndex.from_tuples(overall_data[i], names = ['Filename', 'Topics', 'Average Sentiment'])
    all_dataframes.append(pd.DataFrame(index = index))
final_results = pd.concat(all_dataframes)
final_results.reset_index(level=['Average Sentiment'], inplace = True)
final_results['Title'] = "na"
final_results['Article date'] = "na"
In [225]:
final_results
Out[225]:
Average Sentiment Title Article date
Filename Topics
EMERGING MARKETS-Mexican peso seesaws over dollar; Argentina stocks hit record 6 0.062143 na na
8 -0.006000 na na
14 -0.034000 na na
29 0.105000 na na
Migrants must visit Nazi concentration camps, Germany’s Jewish council says 6 0.088000 na na
Delta Air to tighten onboard emotional support animal requirements 15 0.193000 na na
18 0.307000 na na
22 0.215000 na na
JP Morgan upgrades Kohl’s to overweight 6 0.058000 na na
BRIEF-Osisko Announces Record 2017 Gold Equivalent Ounces 7 0.000000 na na

441 rows × 3 columns

In [226]:
index = 0
prev_filename = final_results.index.get_level_values('Filename')[0]
for i in range(len(final_results.index.get_level_values('Filename'))):
    filename_split = final_results.index.get_level_values('Filename')[i]
    final_results['Title'][i] = filename_split
    
    if filename_split != prev_filename:
        index += 1
    final_results['Article date'][i] = data_text['published'][index]
    
    prev_filename = filename_split
In [227]:
final_results
Out[227]:
Average Sentiment Title Article date
Filename Topics
EMERGING MARKETS-Mexican peso seesaws over dollar; Argentina stocks hit record 6 0.062143 EMERGING MARKETS-Mexican peso seesaws over dol… 2018-01-26T01:01:00.000+02:00
8 -0.006000 EMERGING MARKETS-Mexican peso seesaws over dol… 2018-01-26T01:01:00.000+02:00
14 -0.034000 EMERGING MARKETS-Mexican peso seesaws over dol… 2018-01-26T01:01:00.000+02:00
29 0.105000 EMERGING MARKETS-Mexican peso seesaws over dol… 2018-01-26T01:01:00.000+02:00
Migrants must visit Nazi concentration camps, Germany’s Jewish council says 6 0.088000 Migrants must visit Nazi concentration camps, … 2018-01-10T21:52:00.000+02:00
Delta Air to tighten onboard emotional support animal requirements 15 0.193000 Delta Air to tighten onboard emotional support… 2018-01-19T21:00:00.000+02:00
18 0.307000 Delta Air to tighten onboard emotional support… 2018-01-19T21:00:00.000+02:00
22 0.215000 Delta Air to tighten onboard emotional support… 2018-01-19T21:00:00.000+02:00
JP Morgan upgrades Kohl’s to overweight 6 0.058000 JP Morgan upgrades Kohl’s to overweight 2018-01-12T19:28:00.000+02:00
BRIEF-Osisko Announces Record 2017 Gold Equivalent Ounces 7 0.000000 BRIEF-Osisko Announces Record 2017 Gold Equiva… 2018-01-18T16:57:00.000+02:00

441 rows × 3 columns

In [231]:
ax = final_results.groupby(['Article date']).mean().plot(figsize = (18, 10))
ax.tick_params(axis='x', rotation=45)
ax.set_title("Average sentiment per article date")
Out[231]:
Text(0.5, 1.0, 'Average sentiment per article date')
In [280]:
ax = final_results.groupby(['Article date', 'Topics'])['Average Sentiment'].mean().unstack().plot(figsize = (18, 10), marker = 'o')
ax.tick_params(axis='x', rotation=90)
ax.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
ax.set_title("Change in topic sentiment per article date")
Out[280]:
Text(0.5, 1.0, 'Change in topic sentiment per article date')

Similarity check

In [259]:
from gensim import similarities

lda_index = similarities.MatrixSimilarity(ldaModel[corpus])
lda_index.save("simIndex.index")
In [260]:
test_doc = dominant_topic_breakdown['Text'][0]
In [261]:
def get_similar_doc(new_doc):
    processed_doc = preprocess(new_doc)
    processed_doc_bow = lda.dictionary.doc2bow(processed_doc)
    
    similarities = lda_index[ldaModel[test_doc_bow]]
    similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
    
    return similarities
In [262]:
def is_duplicate_content(similarities, threshold):
    # highest similarity
    if similarities[0][1] > threshold:
        return True
    return False
In [263]:
similarities = get_similar_doc(dominant_topic_breakdown['Text'][0])
In [264]:
is_duplicate_content(similarities, 1)
Out[264]:
True
Ryan

Ryan

Data Scientist

Leave a Reply