Comparison of data selection techniques for the translation of video lectures

Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman, Shachar Mirkin


Abstract
For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.
Anthology ID:
2014.amta-researchers.15
Volume:
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Month:
October 22-26
Year:
2014
Address:
Vancouver, Canada
Editors:
Yaser Al-Onaizan, Michel Simard
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
193–207
Language:
URL:
https://aclanthology.org/2014.amta-researchers.15
DOI:
Bibkey:
Cite (ACL):
Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman, and Shachar Mirkin. 2014. Comparison of data selection techniques for the translation of video lectures. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 193–207, Vancouver, Canada. Association for Machine Translation in the Americas.
Cite (Informal):
Comparison of data selection techniques for the translation of video lectures (Wuebker et al., AMTA 2014)
Copy Citation:
PDF:
https://aclanthology.org/2014.amta-researchers.15.pdf