Kaggle: Tweets Sentiment Extraction

I want to familiar myself with the HuggingFace and PyTorch / PyTorch-Lightning libraries and after doing and reading many different documentations, it’s time to learn by doing / using them in practice. The Tweets Sentiment Extraction is a recent completed Kaggle competition, where given the text and sentiment, you are required to predict the portion of the text that represents that sentiment. My strategy is to tackle multiple example projects, starting with the “simple” NLP task first. In this blog post, I am going through a medium article tutorial on using BERT for the Tweets Sentiment Extraction.

1. Import dependencies + Read datafiles

In [96]:
import torch
import numpy as np
import pandas as pd

from transformers import BertTokenizer

from torch.utils.data import TensorDataset, random_split
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
In [8]:
train_df = pd.read_csv('./data/train.csv')
train_df.head()
Out[8]:
textID text selected_text sentiment
0 cb774db0d1 I`d have responded, if I were going I`d have responded, if I were going neutral
1 549e992a42 Sooo SAD I will miss you here in San Diego!!! Sooo SAD negative
2 088c60f138 my boss is bullying me… bullying me negative
3 9642c003ef what interview! leave me alone leave me alone negative
4 358bd9e861 Sons of ****, why couldn`t they put them on t… Sons of ****, negative

2. Quick Data Analysis

In [9]:
print(len(train_df))
27481
In [10]:
# Check for missing data
train_df.isna().sum()
Out[10]:
textID           0
text             1
selected_text    1
sentiment        0
dtype: int64
In [38]:
# Remove rows with missing data
train_df.dropna(axis = 0, inplace = True)
train_df.reset_index(drop = True, inplace = True)
In [13]:
# A fairly balance dataset with neutral class being 3-4K more (might still need to do class weights / sampling)
train_df['sentiment'].value_counts()
Out[13]:
neutral     11117
positive     8582
negative     7781
Name: sentiment, dtype: int64

3. Encoding Input using BERT Tokenizer

Our inputs are sentiment and tweets text and so we need to encode them into BERT form. This involves:

  1. Concatenate tweets text and sentiment together
  2. Encode them using BERT tokenizer (input_ids)
  3. Pad the encoded sequences to ensure they are all the same length (pad_to_max_length = True)
  4. Differentiate between sentiment text and tweet text and word tokens and pad tokens
    • attention_masks for differentiating word tokens and padding tokens
    • token_type_ids for differentiating sentiment text and tweet text
In [25]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)
In [31]:
tokenizer.decode(tokenizer.encode(train_df['sentiment'][0], train_df['text'][0]))
Out[31]:
'[CLS] neutral [SEP] i ` d have responded, if i were going [SEP]'
In [33]:
encoded = tokenizer.encode_plus(train_df['sentiment'][0], train_df['text'][0], add_special_tokens = True, max_length = 150, 
                      pad_to_max_length = True, return_token_type_ids = True, return_attention_mask = True, return_tensors = 'pt')

print(encoded['input_ids'])
print('------------------')
print(encoded['attention_mask'])
print('------------------')
print(encoded['token_type_ids'])
tensor([[ 101, 8699,  102, 1045, 1036, 1040, 2031, 5838, 1010, 2065, 1045, 2020,
         2183,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0]])
------------------
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]])
------------------
tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]])
In [83]:
input_ids = []
attention_masks = []
token_type_ids = []

for i in range(len(train_df)):
    encoded = tokenizer.encode_plus(train_df['sentiment'][i], train_df['text'][i], add_special_tokens = True, max_length = 150, 
                      pad_to_max_length = True, return_token_type_ids = True, return_attention_mask = True, return_tensors = 'pt')
    
    input_ids.append(encoded['input_ids'])
    attention_masks.append(encoded['attention_mask'])
    token_type_ids.append(encoded['token_type_ids'])
In [84]:
# concatenate all the elements into one tensor
input_ids = torch.cat(input_ids, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)
token_type_ids = torch.cat(token_type_ids, dim = 0)

4. Find the start and end position of the selected text that supports the sentiment

In [85]:
input_ids = input_ids.numpy()
attention_masks = attention_masks.numpy()
token_type_ids = token_type_ids.numpy()

selected_text = train_df['selected_text'].apply(str)
selected_text = selected_text.values

start_pos = []
end_pos = []
count = 0

length_text = len(input_ids)
print(length_text)

i = 0
while i < length_text:
    
    text_ids = input_ids[i]
    selected_text_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(selected_text[i], add_special_tokens=True))
    first = selected_text_ids[0]
    
    if len(selected_text_ids) == 1:
        second = -1
    else:
        second = selected_text_ids[1]
    
    pos = -1
    ctr = -1
    for j in range(len(text_ids) - 1):
        pos += 1
        if second == -1:
            if text_ids[j] == first:
                ctr *= -1
                start_pos.append(pos)
                break
        else:
            if text_ids[j] == first and text_ids[j+1] == second:
                ctr *= -1
                start_pos.append(pos)
                break
    if ctr == -1:
        count += 1
        selected_text = np.delete(selected_text, i)
        input_ids = np.delete(input_ids, i, axis=0)
        attention_masks = np.delete(attention_masks, i, axis=0)
        token_type_ids = np.delete(token_type_ids, i, axis=0)
        
        length_text -= 1
        i -= 1
    else:
        end_pos.append(pos + len(selected_text_ids) - 1)
    i +=1

print("count", count)
27480
count 936
In [91]:
input_ids[0]
Out[91]:
array([ 101, 8699,  102, 1045, 1036, 1040, 2031, 5838, 1010, 2065, 1045,
       2020, 2183,  102,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0])
In [90]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize(train_df['selected_text'][0], add_special_tokens = True))
Out[90]:
[1045, 1036, 1040, 2031, 5838, 1010, 2065, 1045, 2020, 2183]
In [92]:
start_pos[0]
Out[92]:
3
In [93]:
end_pos[0]
Out[93]:
12

5. PyTorch’s DataLoaders

In [94]:
# Convert all Numpy arrays into tensors
input_ids = torch.Tensor(input_ids)
attention_masks = torch.Tensor(attention_masks)
token_type_ids = torch.Tensor(token_type_ids)
start_pos = torch.Tensor(start_pos)
end_pos = torch.Tensor(end_pos)
In [97]:
# Create a tensordataset and split it into train and validation set
dataset = TensorDataset(input_ids, attention_masks, token_type_ids, start_pos, end_pos)

train_split = int(0.9 * len(dataset))
val_split = len(dataset) - train_split

print("train_split", train_split)
print("val_split", val_split)

trainset, valset = random_split(dataset, [train_split, val_split])
train_split 23889
val_split 2655
In [98]:
batch_size = 64

# batching our data
trainloader = DataLoader(trainset, batch_size, sampler = RandomSampler(trainset))
valloader = DataLoader(valset, batch_size, sampler = SequentialSampler(valset))
Ryan

Ryan

Data Scientist

Leave a Reply