Transformer-Based Language Models for Bulgarian

Iva Marinova, Kiril Simov, Petya Osenova


Abstract
This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.
Anthology ID:
2023.ranlp-1.77
Volume:
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
712–720
Language:
URL:
https://aclanthology.org/2023.ranlp-1.77
DOI:
Bibkey:
Cite (ACL):
Iva Marinova, Kiril Simov, and Petya Osenova. 2023. Transformer-Based Language Models for Bulgarian. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 712–720, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Transformer-Based Language Models for Bulgarian (Marinova et al., RANLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ranlp-1.77.pdf