Towards One-to-Many Visual Question Answering

Huishan Ji, Qingyi Si, Zheng Lin, Yanan Cao, Weiping Wang


Abstract
Most existing Visual Question Answering (VQA) systems are constrained to support domain-specific questions, i.e., to train different models separately for different VQA tasks, thus generalizing poorly to others. For example, models trained on the reasoning-focused dataset GQA struggle to effectively handle samples from the knowledge-emphasizing dataset OKVQA. Meanwhile, in real-world scenarios, it is user-unfriendly to restrict the domain of questions. Therefore, this paper proposes a necessary task: One-to-Many Visual Question Answering, of which the ultimate goal is to enable a single model to answer as many different domains of questions as possible by the effective integration of available VQA resources. To this end, we first investigate into ten common VQA datasets, and break the task of VQA down into the integration of three key abilities.Then, considering assorted questions rely on different VQA abilities, this paper proposes a novel dynamic Mixture of LoRAs (MoL) strategy. MoL mixes three individually trained LoRA adapters (corresponding to each VQA ability) dynamically for different samples demanding various VQA abilities. The proposed MoL strategy is verified to be highly effective by experiments, establishing SOTAs on four datasets. In addition, MoL generalizes well to three extra zero-shot datasets.Data and codes will be released.
Anthology ID:
2024.findings-emnlp.985
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16931–16943
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.985/
DOI:
10.18653/v1/2024.findings-emnlp.985
Bibkey:
Cite (ACL):
Huishan Ji, Qingyi Si, Zheng Lin, Yanan Cao, and Weiping Wang. 2024. Towards One-to-Many Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16931–16943, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Towards One-to-Many Visual Question Answering (Ji et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.985.pdf