Fake News Classifier II

The last post is about loading the fine-tuned BERT and perform inference on another Kaggle news dataset that I have downloaded. I realised that many tutorials out there skip this step, assuming people can easily find the codes to load and perform inference. This blog post is to save the future me the hassle to have to look this up again!

The main output of this pipeline is a trained BERT fake news classifier!

Load the trained model to test if everything is working!

In [75]:
# Load
model = BertClassifier()
model.load_state_dict(torch.load('./bert-fake-news/bert-fake-news-classifier.pt'))
model.to(device)
Out[75]:
<All keys matched successfully>
In [78]:
# Compute predicted probabilities on the test set
test_probs = bert_predict(model, test_dataloader)

# Get predictions from the probabilities
threshold = 0.9
preds_saved_model = np.where(test_probs[:, 1] > threshold, 1, 0)

# Number of tweets predicted non-negative
print("Number of tweets predicted non-negative: ", preds_saved_model.sum())
Number of tweets predicted non-negative:  2589
In [79]:
test_df['label'] = preds_saved_model
test_df['label'].value_counts()
Out[79]:
0    2611
1    2589
Name: label, dtype: int64

Evaluation on Fake and real news dataset

Perfect comparison and evaluation dataset. We were able to achieve 74% F1-score on this dataset using our trained logistic regression.

In [82]:
fake_df = pd.read_csv(root_dir + 'fake_real_news/Fake.csv')
In [84]:
real_df = pd.read_csv(root_dir + 'fake_real_news/True.csv')
In [85]:
fake_df['label'] = 1
real_df['label'] = 0
In [86]:
overall_df = pd.concat([real_df, fake_df]).reset_index(drop = True)
In [88]:
overall_df['combined'] = overall_df['title'] + ' ' + overall_df['text']
In [89]:
# Preparing the test data
overall_inputs, overall_masks = bert_preprocessing(overall_df['combined'])

# Create the DataLoader for our test set
overall_dataset = TensorDataset(overall_inputs, overall_masks)
overall_sampler = SequentialSampler(overall_dataset)
overall_dataloader = DataLoader(overall_dataset, sampler=overall_sampler, batch_size=32)
In [90]:
# Compute predicted probabilities on the test set
overall_probs = bert_predict(bert_classifier, overall_dataloader)

# Get predictions from the probabilities
threshold = 0.9
overall_preds = np.where(overall_probs[:, 1] > threshold, 1, 0)

# Number of tweets predicted non-negative
print("Number of tweets predicted non-negative: ", overall_preds.sum())

Number of tweets predicted non-negative:  39035
In [93]:
from sklearn.metrics import confusion_matrix, classification_report
In [94]:
confusion_matrix(overall_df['label'], overall_preds)
Out[94]:
array([[ 5770, 15647],
       [   93, 23388]])
In [96]:
print(classification_report(overall_df['label'], overall_preds))
              precision    recall  f1-score   support

           0       0.98      0.27      0.42     21417
           1       0.60      1.00      0.75     23481

    accuracy                           0.65     44898
   macro avg       0.79      0.63      0.59     44898
weighted avg       0.78      0.65      0.59     44898
Ryan

Ryan

Data Scientist

Leave a Reply