Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering

Omar Adjali, Olivier Ferret, Sahar Ghannay, Hervé Le Borgne


Abstract
The Knowledge-Aware Visual Question Answering about Entity task aims to disambiguate entities using textual and visual information, as well as knowledge. It usually relies on two independent steps, information retrieval then reading comprehension, that do not benefit each other. Retrieval Augmented Generation (RAG) offers a solution by using generated answers as feedback for retrieval training. RAG usually relies solely on pseudo-relevant passages retrieved from external knowledge bases which can lead to ineffective answer generation. In this work, we propose a multi-level information RAG approach that enhances answer generation through entity retrieval and query expansion. We formulate a joint-training RAG loss such that answer generation is conditioned on both entity and passage retrievals. We show through experiments new state-of-the-art performance on the VIQuAE KB-VQA benchmark and demonstrate that our approach can help retrieve more actual relevant knowledge to generate accurate answers.
Anthology ID:
2024.emnlp-main.922
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16499–16513
Language:
URL:
https://aclanthology.org/2024.emnlp-main.922/
DOI:
10.18653/v1/2024.emnlp-main.922
Bibkey:
Cite (ACL):
Omar Adjali, Olivier Ferret, Sahar Ghannay, and Hervé Le Borgne. 2024. Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16499–16513, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering (Adjali et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.922.pdf