The next step in the process is text processing. This is where we analyse the support messages to derive methods to process / clean the data. In our text processing function below, it will transform our support messages as follows:

  1. Remove all the missing message body from our dataset.

  2. Remove Spotify handles, URLs, punctuations, numbers, and any special symbols. We will also lowercase all the texts.

  3. Remove any duplicated messages (might be spams generated from bots).

  4. Remove stopwords.

  5. Lemmatisation.

  6. Remove messages that are too short (less than 5 tokens) or too long (more than 25 tokens).

  7. Join all the tokens into a single string.

Note: Firstly, lemmatisation is our biggest speed cost as it requires pos tagging each word lemmatising them. Secondly, we derive messages that are too short and too long (anomalies) by using the 25th and 95th quartiles.

Importing dependencies and setting up environment

import re
import pandas as pd

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer

# Read datafile & load text processing tools (stopwords, tokeniser, and lemmatiser)
sorted_data = pd.read_csv('sorted_data.csv')
english_stopwords = stopwords.words('english')
tokeniser = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

Text processing function

# Text processing
def text_processing(dataframe, text_column):
    # Remove initial missing message_body (total of 14 rows)
    dataframe.dropna(subset = [text_column], axis = 0, inplace = True)
    cleaned_text_column = text_column + '_processed'    
    # Remove spotify handles, urls, punctuations, numbers, and special symbols. Lowercase texts.
    dataframe[cleaned_text_column] = dataframe[text_column].apply(lambda x : ' '.join(re.sub(r"(@[A-Za-z0-9]+)|(\w+:\/*\S+)|([^a-zA-Z\s])", " ", x).split()).lower())

    # Drop the duplicates in message_body_processed (should we remove duplications?)
    dataframe.drop_duplicates(subset = [cleaned_text_column], keep = 'first', inplace = True)

    # Remove stopwords
    dataframe[cleaned_text_column] = dataframe[cleaned_text_column].apply(lambda x : [word for word in tokeniser.tokenize(x) if word not in english_stopwords])

    # Lemmatisation - speed overhead
    def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)

    dataframe[cleaned_text_column] = dataframe[cleaned_text_column].apply(lambda x : [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in x]) 

    # Only keep messages that's greater than 5 words and less than 200 words (HARDCODED)
    # print(dataframe[cleaned_text_column].map(len).quantile([0.25, 0.5, 0.75, 0.95])) # 5, 8, 12, 18
    cleaned_data = dataframe[(dataframe[cleaned_text_column].map(len) >= 5) & (dataframe[cleaned_text_column].map(len) < 25)]
    cleaned_data.reset_index(drop = True, inplace = True)
    # Join the tokens into a string again
    cleaned_data[cleaned_text_column] = cleaned_data[cleaned_text_column].apply(lambda x : ' '.join(x))

    return cleaned_data

Apply our function to the Spotify dataset

cleaned_dataset = text_processing(sorted_data, 'message_body')

Output of Text Processing



Data Scientist

Leave a Reply