Objective and Contribution
Investigate how racial bias has been introduced by annotators into the datasets for hate speech detection, increasing the harm against minority races and proposed a method to prime dialect and race, to reduce racial bias in annotation. The contributions of the paper are as follows:
Found unexpected correlation between surface markets of African American English (AAE) and toxicity ratings in few common hate speech datasets
Found that models trained on these datasets propagate these biases where AAE tweets are twice more likely to be labelled as offensive
Proposed a dialect and race priming to reduce racial bias in annotation by highlighting inferred dialect of a tweet or racial background of the author
The paper uses AAE dialect as a proxy for race. This means that each tweet will be classified as either AAE tweet or not. The rationale for choosing AAE dialect is that it is commonly used by African American in social media. The dialect estimation is conducted using a lexical detector of words that’s link to either AAE or white-aligned English. Given a tweet, the detector will output probabilities of the tweet being AAE or white-aligned English.
Biases in Toxic Language Datasets
The analysis is conducted on two tweets dataset that are common in hate speech detection:
DWMW17. 25K tweets with 3 offensive labels: hate speech, offensive, or none
FDCL18. 100K tweets with 4 offensive labels: hateful, abusive, spam, or none
The paper computed the correlation between each offensive labels and the dialect probability that it is an AAE tweet. The results are shown in the table below, with strong correlation between AAE tweets and different hate speech labels. For example, a 0.42 correlation between AAE tweets and offensive label, meaning that if a tweet is classified as offensive, it has a high probability to be an AAE tweet.
Bias Propagation through Models
This investigation is broken down to two steps:
Find out the differences in false positives (FP) between AAE and white-aligned English tweets
Use our trained models (on DWMW17 and FDCL18 dataset) to compute average rates of toxicity on the Demographic16 and Userlevelrace18 twitter toxic language datasets
The results are shown in the figure below. The left table is the results of the first step and the middle and right graphs showcase the results of the second step. The left table shows us that although models have achieved high accuracies in both datasets, the DWMW17 classifier predicted around 46% of non-offensive AAE tweets as offensive. Similar findings are also found in the FDCL18 classifier. Meanwhile, both classifiers tend to have a high FP rate on ‘None’ category for white-aligned tweets, indicating an underlying discrimination of the models.
These results are further supported by the results in the second step where AAE tweets are twice as likely to be classified as offensive or abusive in the Demographic16 dataset. Similar racial biases are found in the Userlevelrace18 dataset.
Effect of Dialect and Race Priming
An experiment is conducted using Amazon Mechanical Turk to assess the effect of dialect information on offensiveness ratings. The process is as follows:
Ask workers to classify whether a tweet is offensive to them (no / maybe / yes)
Ask workers to classify whether a tweet is offensive to anyone (no / maybe / yes)
However, there are three conditions when workers are classifying tweets:
In the dialect priming, the workers are given the tweet’s dialect (probability that it is an AAE tweet) and was instructed to use tweet’s dialect as a proxy for the author’s race. In the race priming, the workers are instructed to consider thee most likely racial background of the author of the tweet based on its inferred dialect. The results are shown below:
The results show that priming the workers under the dialect and race condition has reduced the number of offensive classification of AAE tweets. Additional findings include annotators are more likely to classify a tweet as offensive to others than to themselves, showcasing the subjectivity of offensive language.
Conclusion and Future Work
The results show that the proposed dialect and race priming have reduces the likelihood that AAE tweets are labelled offensive. This tells us that we should pay extra attention to the annotation process to avoid any unintended biases in hate speech detection.