Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison

Qian Yang, Weixiang Yan, Aishwarya Agrawal


Abstract
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM’s internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM’s direct answer. Experiments across six vision-language tasks with three VLMs show DeCC’s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
Anthology ID:
2024.emnlp-main.211
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3613–3627
Language:
URL:
https://aclanthology.org/2024.emnlp-main.211/
DOI:
10.18653/v1/2024.emnlp-main.211
Bibkey:
Cite (ACL):
Qian Yang, Weixiang Yan, and Aishwarya Agrawal. 2024. Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3613–3627, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison (Yang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.211.pdf