Objective and Contribution
Present Will-They-Won’t-They (WT-WT), the largest stance detection dataset, which contains 51,284 tweets. All annotations are manually labelled by experts, ensuring high-quality evaluation of models. We also applied eleven existing SOTA models to our dataset and show that existing SOTA models struggled with our dataset, indicating our dataset is useful for future research of developing the models further. Lastly, we perform another annotation on M&A operation in the entertainment industry and explore the robustness of our best-performing models when applied to a different domain. We observed that our models struggle to adapt to even a small domain shift.
The dataset covers stance detection for rumours verification in finance, specifically around M&A. The reason for this is because an M&A process has many stages and the evolution of opinion from Twitter users about each stages is similar to that of rumour verification. The process of building the WT-WT dataset involves 4 different steps, covering 5 different M&A operations as shown in the figure below.
Task Definition and Annotation Guidelines
Here, for each operation, we first used Selenium to retrieve tweets IDs for the following tweets:
Tweets that mentioned both companies’ names or acronyms
Tweets that mentioned one of the companies with pre-defined merger-specific terms
The date range covers one year before the proposed merger and six months after the merger took place. We then use Tweepy to retrieve the text of the tweets using the Tweets IDs.
Task Definition and Annotation Guidelines
We have four stance labels:
Support. Tweets that support the merger
Refute. Tweets that express doubts of the merger
Comment. Tweets that comments on the merger but not supporting or refuting
Unrelated. Tweets that are unrelated to the merger
The same sample can have different labels depending on the target entity. In addition, our stance detection is different than targeted sentiment analysis as someone could infer a sentiment towards a merger without expressing whether the merger will happen or not.
The data annotation process were done by 10 finance academics from the University of Cambridge in batches of 2000 samples.
The average correlation between annotator pairs is 0.67, showcasing strong quality of the data. We also asked a domain-expert to label a sample of 3000 tweets and used that as a human upper bound for evaluation. Support and comment samples causes the most disagreement between annotators as we believe such samples are mostly subjective. The inclusion of unrelated tags have caused a higher disagreement between unrelated and comment samples, making our dataset more challenging.
The figure below showcase the distribution of the labels for each M&A operation. We observed that there is a correlation between the relative proportion of refuting and supporting samples and the merger being approved or blocked. As usual, commenting tweets are more frequent than supporting over all operations.
Comparison with Existing Corpora
Here, we compared our dataset to existing dataset as shown in the table below. As mentioned, our dataset is the largest stance detection dataset by a wide margin. Besides the size, our annotation process involves highly skilled domain experts rather than crowd-sourcing. In addition, our dataset contains different domains for cross-domain research.
Experiments and Results
We selected and re-implemented 11 strong models that were previously used for stance detection. The results are display below. SiamNet performed the best in both averaged and weighted averaged F1 scores. As usual, SVM provides a strong baseline for stance detection. In terms of different class classification, models seems to have a relatively higher number of misclassifications between the support and comment classes. The inclusion of linguistic features seem to reduce the misclassifications. CharCNN obtain the best performance for unrelated samples, indicating we should use character-level information for future architectures.
Robustness over Domain Shifts
Here, we explore how our best model performs with cross-domain experiment on an M&A event in the entertainment industry. The results are displayed below. The results show strong performance when models are trained and tested with the same domain dataset. When model is trained on health or entertainment dataset and tested on another domain, we observed a significant drop in performance, showcasing that our models have strong domain dependency.
Conclusion and Future Work
We show that existing SOTA models performed 10% lower on our dataset which represents human upperbound. Potential future research could involve exploring transformer-based models and different model architecture on the dataset. In addition, the dataset contains multiple domains, allowing future research in cross-target and cross-domain using our dataset.