Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model

Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe


Abstract
Simultaneous Speech Translation (SiST) begins translating before the entire source input is received, making it crucial to balance quality and latency. In real interpreting situations, interpreters manage this simultaneity by breaking sentences into smaller segments and translating them while maintaining the source order as much as possible. SiST could benefit from this approach to balance quality and latency. However, current corpora used for simultaneous tasks often involve significant word reordering in translation, which is not ideal given that interpreters faithfully follow source syntax as much as possible. Inspired by conference interpreting by humans utilizing the salami technique, we introduce the Simul-MuST-C, a dataset created by leveraging the Large Language Model (LLM), specifically GPT-4o, which aligns the target text as closely as possible to the source text by using minimal chunks that contain enough information to be interpreted. Experiments on three language pairs show that the effectiveness of segmented-base monotonicity in training data varies with the grammatical distance between the source and the target, with grammatically distant language pairs benefiting the most in achieving quality while minimizing latency.
Anthology ID:
2024.emnlp-main.1238
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22185–22205
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1238/
DOI:
10.18653/v1/2024.emnlp-main.1238
Bibkey:
Cite (ACL):
Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2024. Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22185–22205, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model (Makinae et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1238.pdf