Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise

Piush Aggarwal, Torsten Zesch


Abstract
Hate speech detection systems have been shown to be vulnerable against obfuscation attacks, where a potential hater tries to circumvent detection by deliberately introducing noise in their posts. In previous work, noise is often introduced for all words (which is likely overestimating the impact) or single untargeted words (likely underestimating the vulnerability). We perform a user study asking people to select words they would obfuscate in a post. Using this realistic setting, we find that the real vulnerability of hate speech detection systems against deliberately introduced noise is almost as high as when using a whitebox attack and much more severe than when using a non-targeted dictionary. Our results are based on 4 different datasets, 12 different obfuscation strategies, and hate speech detection systems using different paradigms.
Anthology ID:
2022.wnut-1.25
Volume:
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
230–242
Language:
URL:
https://aclanthology.org/2022.wnut-1.25
DOI:
Bibkey:
Cite (ACL):
Piush Aggarwal and Torsten Zesch. 2022. Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), pages 230–242, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise (Aggarwal & Zesch, WNUT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wnut-1.25.pdf