Learning Musical Representations for Music Performance Question Answering

Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui


Abstract
Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities on general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance, and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to inaccurately answer questions regarding musical performances. To bridge the above research gaps, first, given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; second, to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; third, for time-aware audio-visual modelling, we align the model’s music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at: https://github.com/xid32/Amuse.
Anthology ID:
2024.findings-emnlp.159
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2803–2813
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.159/
DOI:
10.18653/v1/2024.findings-emnlp.159
Bibkey:
Cite (ACL):
Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, and Jiang Gui. 2024. Learning Musical Representations for Music Performance Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2803–2813, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Learning Musical Representations for Music Performance Question Answering (Diao et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.159.pdf