Back to Basic: Fine Tuning BERT for Sentiment Analysis

As I am trying to get more familiar with PyTorch (and eventually PyTorch Lightning), this tutorial serves great purpose for me. It uses both HuggingFace and PyTorch, a combination that I often see in NLP research! I will split this tutorial into two posts: Step 1 – 5 in this post and step 6 – 7 in another. Creating the training function and eval function is a big step and would be best cover in another post! Also, step 6 and 7 could be substituted with pytorch lightning in the future!

  1. Load dependencies and datasets
  2. Tokenisation and data processing
  3. PyTorch DataLoader
  4. Bert classifier
  5. Initialise optimizer, loss function, and scheduler
  6. Training and Eval
  7. Eval on test set

1. Import dependencies and download and load dataset!

In [3]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
In [4]:
import requests
request = requests.get("https://drive.google.com/uc?export=download&id=1wHt8PsMLsfX5yNSqrt2fSTcb8LEiclcf")
with open("data.zip", "wb") as file:
    file.write(request.content)

# Unzip data
import zipfile
with zipfile.ZipFile('data.zip') as zip:
    zip.extractall('data')
In [5]:
# Load training data and set labels
data_complaint = pd.read_csv('data/complaint1700.csv')
data_complaint['label'] = 0
data_non_complaint = pd.read_csv('data/noncomplaint1700.csv')
data_non_complaint['label'] = 1

# Creating training data
data = pd.concat([data_complaint, data_non_complaint], axis=0).reset_index(drop=True)
data.drop(['airline'], inplace=True, axis=1)

# Load test data
test_data = pd.read_csv('data/test_data.csv')
test_data = test_data[['id', 'tweet']]
In [6]:
data.head()
Out[6]:
id tweet label
0 80938 @united I’m having issues. Yesterday I rebooke… 0
1 10959 @united kinda feel like the $6.99 you charge f… 0
2 130813 Livid in Vegas, delayed, again& again&… 0
3 146589 @united the most annoying man on earth is on m… 0
4 117579 @united The last 2 weeks I’ve flown wit u, you… 0
In [7]:
data['label'].value_counts()
Out[7]:
1    1700
0    1700
Name: label, dtype: int64
In [8]:
test_data.head()
Out[8]:
id tweet
0 33 @SouthwestAir get your damn act together. Don’…
1 58 @AmericanAir horrible at responding to emails….
2 135 @AmericanAir hey where is your crew? Flight aa…
3 159 Ok come on we are late let’s goooo @united
4 182 @AmericanAir since you are now affiliated with…

2. Train / Val split

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(data['tweet'], data['label'], test_size = 0.2, random_state = 123)

3. Set up GPU device

In [10]:
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
In [11]:
torch.cuda.get_device_name()
Out[11]:
'Tesla P100-PCIE-16GB'

4. Data Processing and Tokenisation for BERT

The encode_plus() method from the tokenizer handles everything for us!

  • Tokenise text into tokens
  • Add special tokens specific for BERT
  • Convert tokens to indices
  • Pad / truncate sentences to max length
  • Create attention mask
In [16]:
# !pip install transformers
In [17]:
# Simple text preprocessing for removing entity mentions, replace anomaly, and remove trailing spaces
def text_preprocessing(text):

    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&' with '&'
    text = re.sub(r'&', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text
In [18]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)
In [19]:
def bert_preprocessing(data):

  # Initialise empty arrays
  input_ids = []
  attention_masks = []

  # Encode_plus with above processing
  for sent in data:
    encoded_sent = tokenizer.encode_plus(
        text = text_preprocessing(sent),
        add_special_tokens = True,
        max_length = MAX_LEN,
        pad_to_max_length = True,
        return_attention_mask = True,
        truncation = True
    )

    input_ids.append(encoded_sent.get('input_ids'))
    attention_masks.append(encoded_sent.get('attention_mask'))
  
  # Convert list to tensors
  input_ids = torch.tensor(input_ids)
  attention_masks = torch.tensor(attention_masks)

  return input_ids, attention_masks
In [20]:
MAX_LEN = 60

train_inputs, train_masks = bert_preprocessing(X_train)
val_inputs, val_masks = bert_preprocessing(X_val)

5. Batching and Loading Data using PyTorch DataLoader

In [21]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

train_labels = torch.tensor(y_train.values)
val_labels = torch.tensor(y_val.values)

batch_size = 32
In [22]:
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size)
In [23]:
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = RandomSampler(val_data)
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size = batch_size)

6. BERT Classifier

In [24]:
import torch
import torch.nn as nn
from transformers import BertModel

class BertClassifier(nn.Module):
  def __init__(self, freeze_bert = False):
    super(BertClassifier, self).__init__()

    D_in, H, D_out = 768, 50, 2

    # Bert layer
    self.bert = BertModel.from_pretrained('bert-base-uncased')

    # Linear layer with ReLU
    self.classifier = nn.Sequential(
        nn.Linear(D_in, H),
        nn.ReLU(),
        nn.Linear(H, D_out)
    )

    if freeze_bert:
      for param in self.bert.parameters():
        param.requires_grad = False
  
  def forward(self, input_ids, attention_mask):
    outputs = self.bert(input_ids = input_ids, attention_mask = attention_mask)
    first_hidden_state_cls = outputs[0][:, 0, :]

    logits = self.classifier(first_hidden_state_cls)

    return logits
In [26]:
from transformers import AdamW, get_linear_schedule_with_warmup

# bert classifier
bert_classifier = BertClassifier()
bert_classifier.to(device)

# optimiser
optimizer = AdamW(bert_classifier.parameters(), lr = 5e-5, eps=1e-8)

epochs = 4
# scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = len(train_dataloader) * epochs)

# loss function
loss_fn = nn.CrossEntropyLoss()
Ryan

Ryan

Data Scientist

Leave a Reply