Fake News Classifier II
The last post is about loading the fine-tuned BERT and perform inference on another Kaggle news dataset that I have downloaded. I realised that many tutorials out there skip this step, assuming people can easily find the codes to load and perform inference. This blog post is to save the future me the hassle to have to look this up again!
The main output of this pipeline is a trained BERT fake news classifier!
Load the trained model to test if everything is working!
# Load model = BertClassifier() model.load_state_dict(torch.load('./bert-fake-news/bert-fake-news-classifier.pt')) model.to(device)
<All keys matched successfully>
# Compute predicted probabilities on the test set test_probs = bert_predict(model, test_dataloader) # Get predictions from the probabilities threshold = 0.9 preds_saved_model = np.where(test_probs[:, 1] > threshold, 1, 0) # Number of tweets predicted non-negative print("Number of tweets predicted non-negative: ", preds_saved_model.sum())
Number of tweets predicted non-negative: 2589
test_df['label'] = preds_saved_model test_df['label'].value_counts()
0 2611 1 2589 Name: label, dtype: int64
Evaluation on Fake and real news dataset
Perfect comparison and evaluation dataset. We were able to achieve 74% F1-score on this dataset using our trained logistic regression.
fake_df = pd.read_csv(root_dir + 'fake_real_news/Fake.csv')
real_df = pd.read_csv(root_dir + 'fake_real_news/True.csv')
fake_df['label'] = 1 real_df['label'] = 0
overall_df = pd.concat([real_df, fake_df]).reset_index(drop = True)
overall_df['combined'] = overall_df['title'] + ' ' + overall_df['text']
# Preparing the test data overall_inputs, overall_masks = bert_preprocessing(overall_df['combined']) # Create the DataLoader for our test set overall_dataset = TensorDataset(overall_inputs, overall_masks) overall_sampler = SequentialSampler(overall_dataset) overall_dataloader = DataLoader(overall_dataset, sampler=overall_sampler, batch_size=32)
# Compute predicted probabilities on the test set overall_probs = bert_predict(bert_classifier, overall_dataloader) # Get predictions from the probabilities threshold = 0.9 overall_preds = np.where(overall_probs[:, 1] > threshold, 1, 0) # Number of tweets predicted non-negative print("Number of tweets predicted non-negative: ", overall_preds.sum())
Number of tweets predicted non-negative: 39035
from sklearn.metrics import confusion_matrix, classification_report
array([[ 5770, 15647], [ 93, 23388]])
precision recall f1-score support 0 0.98 0.27 0.42 21417 1 0.60 1.00 0.75 23481 accuracy 0.65 44898 macro avg 0.79 0.63 0.59 44898 weighted avg 0.78 0.65 0.59 44898