Smart Lexical Search for Label Flipping Adversial Attack

Alberto Gutiérrez-Megías, Salud María Jiménez-Zafra, L. Alfonso Ureña, Eugenio Martínez-Cámara


Abstract
Language models are susceptible to vulnerability through adversarial attacks, using manipulations of the input data to disrupt their performance. Accordingly, it represents a cibersecurity leak. Data manipulations are intended to be unidentifiable by the learning model and by humans, small changes can disturb the final label of a classification task. Hence, we propose a novel attack built upon explainability methods to identify the salient lexical units to alter in order to flip the classification label. We asses our proposal on a disinformation dataset, and we show that our attack reaches high balance among stealthiness and efficiency.
Anthology ID:
2024.privatenlp-1.11
Volume:
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, Oluwaseyi Feyisetan
Venues:
PrivateNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
97–106
Language:
URL:
https://aclanthology.org/2024.privatenlp-1.11
DOI:
Bibkey:
Cite (ACL):
Alberto Gutiérrez-Megías, Salud María Jiménez-Zafra, L. Alfonso Ureña, and Eugenio Martínez-Cámara. 2024. Smart Lexical Search for Label Flipping Adversial Attack. In Proceedings of the Fifth Workshop on Privacy in Natural Language Processing, pages 97–106, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Smart Lexical Search for Label Flipping Adversial Attack (Gutiérrez-Megías et al., PrivateNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.privatenlp-1.11.pdf