IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha


Abstract
Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a “green” tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
Anthology ID:
2024.emnlp-main.1092
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19584–19601
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1092/
DOI:
10.18653/v1/2024.emnlp-main.1092
Bibkey:
Cite (ACL):
Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, and Dinesh Manocha. 2024. IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19584–19601, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning (Ghosal et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1092.pdf