Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP

Zeliang Zhang, Zhuo Liu, Mingqian Feng, Chenliang Xu


Abstract
CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP’s understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.
Anthology ID:
2024.findings-emnlp.59
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1081–1086
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.59
DOI:
10.18653/v1/2024.findings-emnlp.59
Bibkey:
Cite (ACL):
Zeliang Zhang, Zhuo Liu, Mingqian Feng, and Chenliang Xu. 2024. Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1081–1086, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP (Zhang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.59.pdf