VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, Hai Zhao


Abstract
The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.
Anthology ID:
2024.emnlp-main.821
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14791–14804
Language:
URL:
https://aclanthology.org/2024.emnlp-main.821/
DOI:
10.18653/v1/2024.emnlp-main.821
Bibkey:
Cite (ACL):
Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, and Hai Zhao. 2024. VHASR: A Multimodal Speech Recognition System With Vision Hotwords. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14791–14804, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
VHASR: A Multimodal Speech Recognition System With Vision Hotwords (Hu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.821.pdf