An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages

Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, David Yarowsky


Abstract
In this work, we explore massively multilingual low-resource neural machine translation. Using translations of the Bible (which have parallel structure across languages), we train models with up to 1,107 source languages. We create various multilingual corpora, varying the number and relatedness of source languages. Using these, we investigate the best ways to use this many-way aligned resource for multilingual machine translation. Our experiments employ a grammatically and phylogenetically diverse set of source languages during testing for more representative evaluations. We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language. Furthermore, training on related languages can improve or degrade performance, depending on the language. As there is no one-size-fits-most answer, we find that it is critical to tailor one’s approach to the source language and its typology.
Anthology ID:
2020.lrec-1.458
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3710–3718
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.458
DOI:
Bibkey:
Cite (ACL):
Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, and David Yarowsky. 2020. An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3710–3718, Marseille, France. European Language Resources Association.
Cite (Informal):
An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages (Mueller et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.458.pdf