MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Junjie Wang, Yatai Ji, Jiaqi Sun, Yujiu Yang, Tetsuya Sakai


Abstract
In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intra-modality information. Inspired by this observation, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pre-train MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA-1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.
Anthology ID:
2021.findings-emnlp.196
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2280–2292
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.196
DOI:
10.18653/v1/2021.findings-emnlp.196
Bibkey:
Cite (ACL):
Junjie Wang, Yatai Ji, Jiaqi Sun, Yujiu Yang, and Tetsuya Sakai. 2021. MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2280–2292, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering (Wang et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.196.pdf
Software:
 2021.findings-emnlp.196.Software.zip
Video:
 https://aclanthology.org/2021.findings-emnlp.196.mp4
Code
 iigroup/mirtt
Data
TDIUCVisual Question AnsweringVisual Question Answering v2.0Visual7W