An Empirical Study of Multilingual Vocabulary for Neural Machine Translation Models

Kenji Imamura, Masao Utiyama


Abstract
In this paper, we discuss multilingual vocabulary for neural machine translation models. Multilingual vocabularies should generate highly accurate machine translations regardless of the languages, and have preferences so that tokenized strings contain rare out-of-vocabulary (OOV) tokens and token sequences are short. In this paper, we discuss the characteristics of various multilingual vocabularies via tokenization and translation experiments. We also present our recommended vocabulary and tokenizer.
Anthology ID:
2024.wat-1.2
Volume:
Proceedings of the Eleventh Workshop on Asian Translation (WAT 2024)
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Toshiaki Nakazawa, Isao Goto
Venue:
WAT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22–35
Language:
URL:
https://aclanthology.org/2024.wat-1.2/
DOI:
10.18653/v1/2024.wat-1.2
Bibkey:
Cite (ACL):
Kenji Imamura and Masao Utiyama. 2024. An Empirical Study of Multilingual Vocabulary for Neural Machine Translation Models. In Proceedings of the Eleventh Workshop on Asian Translation (WAT 2024), pages 22–35, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
An Empirical Study of Multilingual Vocabulary for Neural Machine Translation Models (Imamura & Utiyama, WAT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wat-1.2.pdf
Supplementarymaterial:
 2024.wat-1.2.SupplementaryMaterial.zip
Supplementarymaterial:
 2024.wat-1.2.SupplementaryMaterial.txt