Improving Language Coverage on HeLI-OTS

Tommi Jauhiainen, Krister Lindén


Abstract
In this paper, we add under-resourced languages into the language repertoire of an existing off-the-shelf language identifier, HeLI-OTS. Adding more languages to a language identifier often comes with the drawback of lessened accuracy for the languages already part of the repertoire. We aim to minimize this effect. As sources for training and development data in the new languages, we use the OpenLID and FLORES-200 datasets. They are openly available high-quality datasets that are especially well-suited for language identifier development. By carefully inspecting the effect of each added language and the quality of their training and development data, we managed to add support for 20 new under-resourced languages to HeLI-OTS without affecting the performance of any existing languages to a noticeable extent.
Anthology ID:
2024.sigul-1.15
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
115–125
Language:
URL:
https://aclanthology.org/2024.sigul-1.15
DOI:
Bibkey:
Cite (ACL):
Tommi Jauhiainen and Krister Lindén. 2024. Improving Language Coverage on HeLI-OTS. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 115–125, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Improving Language Coverage on HeLI-OTS (Jauhiainen & Lindén, SIGUL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigul-1.15.pdf