A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models

Jaewook Lee, Seongsik Park, Seong-Heum Park, Hongjin Kim, Harksoo Kim


Abstract
Most research on multimodal open-domain dialogue agents has focused on pretraining and multi-task learning using additional rich datasets beyond a given target dataset. However, methods for exploiting these additional datasets can be quite limited in real-world settings, creating a need for more efficient methods for constructing agents based solely on the target dataset. To address these issues, we present a new learning strategy called vision-language warm-up tasks for multimodal dialogue models (VLAW-MDM). This strategy does not require the use of large pretraining or multi-task datasets but rather relies solely on learning from target data. Moreover, our proposed approach automatically generate captions for images and incorporate them into the model’s input to improve the contextualization of visual information. Using this novel approach, we empirically demonstrate that our learning strategy is effective for limited data and relatively small models. The result show that our method achieved comparable and in some cases superior performance compared to existing state-of-the-art models on various evaluation metrics.
Anthology ID:
2023.emnlp-main.167
Original:
2023.emnlp-main.167v1
Version 2:
2023.emnlp-main.167v2
Version 3:
2023.emnlp-main.167v3
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2789–2799
Language:
URL:
https://aclanthology.org/2023.emnlp-main.167
DOI:
Bibkey:
Cite (ACL):
Jaewook Lee, Seongsik Park, Seong-Heum Park, Hongjin Kim, and Harksoo Kim. 2023. A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2789–2799, Singapore. Association for Computational Linguistics.
Cite (Informal):
A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models (Lee et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.167.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.167.mp4