Exploratory Data Analysis (EDA)
Before we jump into any data processing or machine learning, we first need to understand the data we are dealing with. This step is known as exploratory data analysis (EDA). There are 15,000 samples and 7 columns in this Spotify dataset. Below is a snapshot of the first 10 rows of the dataset and all the column names:
Connecting the dots
When you first look at the dataset, you should ask yourself, which column(s) would be useful given the problem context? Out of the 7 columns, I believe the message_type, author_id, severity, message_body, and created_at columns could be very useful for the problem context. Here’s why:
The channel of which your customer most frequently used to communicate with the business is an important insight as it allows the business to know where to focus when developing their customer support service.
Customer segmentation is important in B2C businesses and the connection between the author_id and the message_body allows us to group together a bunch of people that raise similar customer support messages. It also allows us to track the number of support messages raised by each author.
Severity tells us how urgent the message is. This is useful particularly when we pair this data with topic modelling results, providing the business a good idea of which topics / areas require the most urgent focus.
The main text of the support message.
The time of which the support message was created by the customer.
Our EDA’s Results
There are 10,766 unique author ids in the dataset, which tells us that a particular user has raised more than one support messages. There are 3 different message types; email, chat, and ticket. All three types have equal representation in the dataset (around 3700 per type). Similarly, there are 3 different severity levels; low, medium, urgent and all three types have equal representation in the dataset (around 5000 per type). Further EDA could be how many urgent messages come from email, chat, and ticket?
There are missing data from the dataset. Out of the 7 columns, 3 columns have missing data:
- 3752 missing data
- 5056 missing data
- 14 missing data
I converted the created_at column to datetime object and created new columns; year, month, day, and hourly. This will allow me to cluster these messages by time. I have sorted the dataset by date.
Output of EDA