Mulan: A Multi-Level Alignment Model for Video Question Answering

Yu Fu, Cong Cao, Yuling Yang, Yuhai Lu, Fangfang Yuan, Dakui Wang, Yanbing Liu


Abstract
Video Question Answering (VideoQA) aims to answer questions about the visual content of a video. Current methods mainly focus on improving joint representations of video and text. However, these methods pay little attention to the fine-grained semantic interaction between video and text. In this paper, we propose Mulan: a Multi-Level Alignment Model for Video Question Answering, which establishes alignment between visual and textual modalities at the object-level, frame-level, and video-level. Specifically, for object-level alignment, we propose a mask-guided visual feature encoding method and a visual-guided text description method to learn fine-grained spatial information. For frame-level alignment, we introduce the use of visual features from individual frames, combined with a caption generator, to learn overall spatial information within the scene. For video-level alignment, we propose an expandable ordinal prompt for textual descriptions, combined with visual features, to learn temporal information. Experimental results show that our method outperforms the state-of-the-art methods, even when utilizing the smallest amount of extra visual-language pre-training data and a reduced number of trainable parameters.
Anthology ID:
2023.findings-emnlp.364
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5475–5489
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.364
DOI:
10.18653/v1/2023.findings-emnlp.364
Bibkey:
Cite (ACL):
Yu Fu, Cong Cao, Yuling Yang, Yuhai Lu, Fangfang Yuan, Dakui Wang, and Yanbing Liu. 2023. Mulan: A Multi-Level Alignment Model for Video Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5475–5489, Singapore. Association for Computational Linguistics.
Cite (Informal):
Mulan: A Multi-Level Alignment Model for Video Question Answering (Fu et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.364.pdf