UOR: Universal Backdoor Attacks on Pre-trained Language Models

Wei Du, Peixuan Li, Haodong Zhao, Tianjie Ju, Ge Ren, Gongshen Liu


Abstract
Task-agnostic and transferable backdoors implanted in pre-trained language models (PLMs) pose a severe security threat as they can be inherited to any downstream task. However, existing methods rely on manual selection of triggers and backdoor representations, hindering their effectiveness and universality across different PLMs or usage paradigms. In this paper, we propose a new backdoor attack method called UOR, which overcomes these limitations by turning manual selection into automatic optimization. Specifically, we design poisoned supervised contrastive learning, which can automatically learn more uniform and universal backdoor representations. This allows for more even coverage of the output space, thus hitting more labels in downstream tasks after fine-tuning. Furthermore, we utilize gradient search to select appropriate trigger words that can be adapted to different PLMs and vocabularies. Experiments show that UOR achieves better attack performance on various text classification tasks compared to manual methods. Moreover, we test on PLMs with different architectures, usage paradigms, and more challenging tasks, achieving higher scores for universality.
Anthology ID:
2024.findings-acl.468
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7865–7877
Language:
URL:
https://aclanthology.org/2024.findings-acl.468
DOI:
10.18653/v1/2024.findings-acl.468
Bibkey:
Cite (ACL):
Wei Du, Peixuan Li, Haodong Zhao, Tianjie Ju, Ge Ren, and Gongshen Liu. 2024. UOR: Universal Backdoor Attacks on Pre-trained Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 7865–7877, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
UOR: Universal Backdoor Attacks on Pre-trained Language Models (Du et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.468.pdf