GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Evan Lucas, Timothy Havens


Abstract
This work analyzes backdoor watermarks in an autoregressive transformer fine-tuned to perform a generative sequence-to-sequence task, specifically summarization. We propose and demonstrate an attack to identify trigger words or phrases by analyzing open ended generations from autoregressive models that have backdoor watermarks inserted. It is shown in our work that triggers based on random common words are easier to identify than those based on single, rare tokens. The attack proposed is easy to implement and only requires access to the model weights. Code used to create the backdoor watermarked models and analyze their outputs is shared at [github link to be inserted for camera ready version].
Anthology ID:
2023.trustnlp-1.21
Volume:
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, Rahul Gupta
Venue:
TrustNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
242–248
Language:
URL:
https://aclanthology.org/2023.trustnlp-1.21
DOI:
10.18653/v1/2023.trustnlp-1.21
Bibkey:
Cite (ACL):
Evan Lucas and Timothy Havens. 2023. GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 242–248, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models (Lucas & Havens, TrustNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.trustnlp-1.21.pdf
Video:
 https://aclanthology.org/2023.trustnlp-1.21.mp4