Kaggle: Tweets Sentiment Extraction

I want to familiar myself with the HuggingFace and PyTorch / PyTorch-Lightning libraries and after doing and reading many different documentations, it’s time to learn by doing / using them in practice. The Tweets Sentiment Extraction is a recent completed Kaggle competition, where given the text and sentiment, you are required to predict the portion of the text that represents that sentiment. My strategy is to tackle multiple example projects, starting with the “simple” NLP task first.

6. Initialise BERT QA Model + Optimiser

In [20]:
from transformers import BertForQuestionAnswering, BertConfig, AdamW, get_linear_schedule_with_warmup
In [21]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased', output_attentions = False, output_hidden_states = False)
In [22]:
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)

epochs = 3
num_train_steps = epochs * len(trainloader)

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = num_train_steps)
In [23]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

7. Training

  1. Turn training mode on
  2. For each epoch:
  3. For each batch:
  4. Get the data into the device (GPU / CPU)
  5. Reset gradient to zero
  6. Forward pass
  7. Backward prop
  8. Compute batch loss
  9. Clip gradient
  10. Update weights + learning rate
In [24]:
def train():
    # turn on training mode
    model.train()
    training_loss = []
    val_loss = []
    
    for i in range(epochs):
        print("epoch: %s" % i)
        batch_loss = 0
        for step, batch in enumerate(trainloader):
            print(step)
            
            # get the batch data
            input_id = batch[0].to(device)
            attention_mask = batch[1].to(device)
            token_type_id = batch[2].to(device)
            start_pos = batch[3].to(device)
            end_pos = batch[4].to(device)
            
            # reset gradient to zero
            model.zero_grad()
            
            # forward pass usingg BERT QA
            loss, start_scores, end_scores = model(input_id.long(),
                                                   attention_mask.long(),
                                                   token_type_id.long(),
                                                   start_positions = start_pos.long(),
                                                   end_positions = end_pos.long())
            
            # Backward prop
            loss.backward()
            # accumulating epoch loss per batch
            batch_loss += loss.item()
            
            # gradient normalisation through clipping to prevent gradient explosion
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            # update weights
            optimizer.step()
            
            # adjust learning rate
            scheduler.step()
        
        average_batch_loss = batch_loss / len(trainloader)
        training_loss.append(average_batch_loss) # save training loss at the end of the epoch
        
        # turn on eval mode
        model.eval()
        
        batch_eval_loss = 0
        
        for step, batch in enumerate(valloader):
            input_id = batch[0].to(device)
            attention_mask = batch[1].to(device)
            token_type_id = batch[2].to(device)
            start_pos = batch[3].to(device)
            end_pos = batch[4].to(device)
            
            with torch.no_grad():
                loss, start_scores, end_scores = model(input_id.long(),
                                                       attention_mask.long(),
                                                       token_type_id.long(),
                                                       start_positions = start_pos.long(),
                                                       end_positions = end_pos.long())
                
            batch_eval_loss += loss.item()
        
        average_batch_eval_loss = batch_eval_loss / len(valloader)
        val_loss.append(average_batch_eval_loss)
In [25]:
train()
In [ ]:
checkpoint = {'model': model,
              'state_dict': model.state_dict(),
              'optimizer' : optimizer.state_dict()}

torch.save(checkpoint, 'trained_model.pth')

8. Inference on Test set

  1. Process the test data the same way as the training data
  2. Create testloader (batching test data)
  3. Model inference on test set
  4. Output results to a .csv file for submission
In [ ]:
test_df = pd.read_csv('./data/test.csv')

# Remove rows with missing data
test_df.dropna(axis = 0, inplace = True)
test_df.reset_index(drop = True, inplace = True)
In [ ]:
test_input_ids = []
test_attention_masks = []
test_token_type_ids = []

for i in range(len(test_df)):
    encoded = tokenizer.encode_plus(test_df['sentiment'][i], test_df['text'][i], add_special_tokens = True, max_length = 150, 
                      pad_to_max_length = True, return_token_type_ids = True, return_attention_mask = True, return_tensors = 'pt')
    
    test_input_ids.append(encoded['input_ids'])
    test_attention_masks.append(encoded['attention_mask'])
    test_token_type_ids.append(encoded['token_type_ids'])
In [ ]:
# concatenate all the elements into one tensor
test_input_ids = torch.cat(test_input_ids, dim = 0)
test_attention_masks = torch.cat(test_attention_masks, dim = 0)
test_token_type_ids = torch.cat(test_token_type_ids, dim = 0)
In [ ]:
testset = TensorDataset(test_input_ids, test_attention_masks, test_token_type_ids)
batch_size = 64

testloader = DataLoader(testset,
                      batch_size,
                      sampler=SequentialSampler(testset))
In [ ]:
df_submit = pd.read_csv("./data/sample_submission.csv")
In [ ]:
def test():
    key = 0
    model.eval()

    for step, batch in enumerate(testloader):

        input_id = batch[0].to(device)
        attention_mask = batch[1].to(device)
        token_type_id = batch[2].to(device)

        with torch.no_grad():
            start_scores, end_scores = model(input_id, attention_mask, token_type_id)
        
        series_submit = pd.Series()
        for i in range(input_id.shape[0]):
            all_tokens = tokenizer.convert_ids_to_tokens(input_id[i])
            answer = ' '.join(all_tokens[torch.argmax(start_scores[i]) : torch.argmax(end_scores[i])+1])
            df_submit['selected_text'][key] = answer
            key += 1

test()
In [ ]:
print(df_submit)
df_submit.to_csv('bert_submission.csv', index = False)
Ryan

Ryan

Data Scientist

Leave a Reply