Pretraining Language Models Using Translationese

Meet Doshi, Raj Dabre, Pushpak Bhattacharyya


Abstract
In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.
Anthology ID:
2024.emnlp-main.334
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5843–5862
Language:
URL:
https://aclanthology.org/2024.emnlp-main.334/
DOI:
10.18653/v1/2024.emnlp-main.334
Bibkey:
Cite (ACL):
Meet Doshi, Raj Dabre, and Pushpak Bhattacharyya. 2024. Pretraining Language Models Using Translationese. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5843–5862, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Pretraining Language Models Using Translationese (Doshi et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.334.pdf