Teodora Vuković

Also published as: Teodora Vukovic


2025

pdf bib
NLP for preserving Torlak, a vulnerable low-resource Slavic language
Li Tang | Teodora Vuković
Proceedings of the 31st International Conference on Computational Linguistics

Torlak is an endangered, low-resource Slavic language with a high degree of areal and inter-speaker variation. In previous work, interviews were performed with Torlak speakers in Serbia, near the Bulgarian border, and the transcripts annotated with lemma and morphosyntactic descriptions at token level. As such token-level annotations facilitate cross-language comparison in the context of the Balkan Sprachbund, where multiple languages influenced Torlak over time, including Serbian and Bulgarian. Here, we aim to improve the prediction of morphosyntactic annotations for this low-resource language using the fine-tuning of large language models, comparing several predictive models. We also further fine-tuned the large language models for scoring the degree of ‘Torlakness’ of a sentence by labeling likely Torlak tokens, to facilitate the documentation of additional Torlak transcribed speech with a high degree of Torlak-style non-standard features compared to standard Serbian. Taken together, we hope that these contributions will help to document this endangered language, and improve digital access for its speakers.

2024

pdf bib
Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification
Aref Farhadipour | Masoumeh Chapariniya | Teodora Vukovic | Volker Dellwo
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

2019

pdf bib
Corpora and Processing Tools for Non-standard Contemporary and Diachronic Balkan Slavic
Teodora Vukovic | Nora Muheim | Olivier Winistörfer | Ivan Šimko | Anastasia Makarova | Sanja Bradjan
Proceedings of the Student Research Workshop Associated with RANLP 2019

The paper describes three corpora of different varieties of BS that are currently being developed with the goal of providing data for the analysis of the diatopic and diachronic variation in non-standard Balkan Slavic. The corpora includes spoken materials from Torlak, Macedonian dialects, as well as the manuscripts of pre-standardized Bulgarian. Apart from the texts, tools for PoS annotation and lemmatization for all varieties are being created, as well as syntactic parsing for Torlak and Bulgarian varieties. The corpora are built using a unified methodology, relying on the pest practices and state-of-the-art methods from the field. The uniform methodology allows the contrastive analysis of the data from different varieties. The corpora under construction can be considered a crucial contribution to the linguistic research on the languages in the Balkans as they provide the lacking data needed for the studies of linguistic variation in the Balkan Slavic, and enable the comparison of the said varieties with other neighbouring languages.