Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

Wei Li, Zhen Huang, Xinmei Tian, Le Lu, Houqiang Li, Xu Shen, Jieping Ye


Abstract
Contrastively trained vision-language models such as CLIP have achieved remarkable progress in vision and language representation learning. Despite the promising progress, their proficiency in compositional reasoning over attributes and relations (e.g., distinguishing between “the car is underneath the person” and “the person is underneath the car”) remains notably inadequate. We investigate the cause for this deficient behavior is the composition attribution issue, where the attribution scores (e.g., attention scores or GradCAM scores) for relations (e.g., underneath) or attributes (e.g., red) in the text are substantially lower than those for object terms. In this work, we show such issue is mitigated via a novel framework called CAE (Composition Attribution Enhancement). This generic framework incorporates various interpretable attribution methods to encourage the model to pay greater attention to composition words denoting relationships and attributes within the text. Detailed analysis shows that our approach enables the models to adjust and rectify the attribution of the texts. Extensive experiments across seven benchmarks reveal that our framework significantly enhances the ability to discern intricate details and construct more sophisticated interpretations of combined visual and linguistic elements.
Anthology ID:
2024.emnlp-main.810
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14616–14632
Language:
URL:
https://aclanthology.org/2024.emnlp-main.810/
DOI:
10.18653/v1/2024.emnlp-main.810
Bibkey:
Cite (ACL):
Wei Li, Zhen Huang, Xinmei Tian, Le Lu, Houqiang Li, Xu Shen, and Jieping Ye. 2024. Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14616–14632, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding (Li et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.810.pdf