2024
pdf
bib
abs
UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
Agata Savary
|
Daniel Zeman
|
Verginica Barbu Mititelu
|
Anabela Barreiro
|
Olesea Caftanatov
|
Marie-Catherine de Marneffe
|
Kaja Dobrovoljc
|
Gülşen Eryiğit
|
Voula Giouli
|
Bruno Guillaume
|
Stella Markantonatou
|
Nurit Melnik
|
Joakim Nivre
|
Atul Kr. Ojha
|
Carlos Ramisch
|
Abigail Walsh
|
Beata Wójtowicz
|
Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.
pdf
bib
abs
Gos 2: A New Reference Corpus of Spoken Slovenian
Darinka Verdonik
|
Kaja Dobrovoljc
|
Tomaž Erjavec
|
Nikola Ljubešić
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.
pdf
bib
abs
SUK 1.0: A New Training Corpus for Linguistic Annotation of Modern Standard Slovene
Špela Arhar Holdt
|
Jaka Čibej
|
Kaja Dobrovoljc
|
Tomaž Erjavec
|
Polona Gantar
|
Simon Krek
|
Tina Munda
|
Nejc Robida
|
Luka Terčon
|
Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces the upgrade of a training corpus for linguistic annotation of modern standard Slovene. The enhancement spans both the size of the corpus and the depth of annotation layers. The revised SUK 1.0 corpus, building on its predecessor ssj500k 2.3, has doubled in size, containing over a million tokens. This expansion integrates three preexisting open-access datasets, all of which have undergone automatic tagging and meticulous manual review across multiple annotation layers, each represented in varying proportions. These layers span tokenization, segmentation, lemmatization, MULTEXT-East morphology, Universal Dependencies, JOS-SYN syntax, semantic role labeling, named entity recognition, and the newly incorporated coreferences. The paper illustrates the annotation processes for each layer while also presenting the results of the new CLASSLA-Stanza annotation tool, trained on the SUK corpus data. As one of the fundamental language resources of modern Slovene, the SUK corpus calls for constant development, as outlined in the concluding section.
2022
pdf
bib
abs
Spoken Language Treebanks in Universal Dependencies: an Overview
Kaja Dobrovoljc
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Given the benefits of syntactically annotated collections of transcribed speech in spoken language research and applications, many spoken language treebanks have been developed in the last decades, with divergent annotation schemes posing important limitations to cross-resource explorations, such as comparing data across languages, grammatical frameworks, and language domains. As a consequence, there has been a growing number of spoken language treebanks adopting the Universal Dependencies (UD) annotation scheme, aimed at cross-linguistically consistent morphosyntactic annotation. In view of the non-central role of spoken language data within the scheme and with little in-domain consolidation to date, this paper presents a comparative overview of spoken language treebanks in UD to support cross-treebank data explorations on the one hand, and encourage further treebank harmonization on the other. Our results show that the spoken language treebanks differ considerably with respect to the inventory and the format of transcribed phenomena, as well as the principles adopted in their morphosyntactic annotation. This is particularly true for the dependency annotation of speech disfluencies, where conflicting data annotations suggest an underspecification of the guidelines pertaining to speech repairs in general and the reparandum dependency relation in particular.
pdf
bib
abs
Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?
Kaja Dobrovoljc
|
Nikola Ljubešić
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
This paper presents the creation and evaluation of a new version of the reference SSJ Universal Dependencies Treebank for Slovenian, which has been substantially improved and extended to almost double the original size. The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus. The annotation campaign resulted in an extended version of the SSJ UD treebank with 5,435 newly added sentences comprising of 126,427 tokens. To evaluate the potential benefits of this data increase for Slovenian dependency parsing, we compared the performance of the classla-stanza dependency parser trained on the old and the new SSJ data when evaluated on the new SSJ test set and its subsets. Our results show an increase of LAS performance in general, especially for previously under-represented syntactic phenomena, such as lists, elliptical constructions and appositions, but also confirm the distinct nature of the two newly added subsets and the diversification of the SSJ treebank as a whole.
2020
pdf
bib
abs
Gigafida 2.0: The Reference Corpus of Written Standard Slovene
Simon Krek
|
Špela Arhar Holdt
|
Tomaž Erjavec
|
Jaka Čibej
|
Andraz Repar
|
Polona Gantar
|
Nikola Ljubešić
|
Iztok Kosem
|
Kaja Dobrovoljc
Proceedings of the Twelfth Language Resources and Evaluation Conference
We describe a new version of the Gigafida reference corpus of Slovene. In addition to updating the corpus with new material and annotating it with better tools, the focus of the upgrade was also on its transformation from a general reference corpus, which contains all language variants including non-standard language, to the corpus of standard (written) Slovene. This decision could be implemented as new corpora dedicated specifically to non-standard language emerged recently. In the new version, the whole Gigafida corpus was deduplicated for the first time, which facilitates automatic extraction of data for the purposes of compilation of new lexicographic resources such as the collocations dictionary and the thesaurus of Slovene.
2019
pdf
bib
abs
What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian
Nikola Ljubešić
|
Kaja Dobrovoljc
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline. Our experiments show significant improvements in morphosyntactic annotation, especially on categories where either semantic knowledge is needed, available through word embeddings, or where long-range dependencies have to be modelled. On the other hand, on the task of lemmatisation no improvements are obtained with the neural solution, mostly due to the heavy dependence of the task on the lookup in an external lexicon, but also due to obvious room for improvements in the Stanford NLP pipeline’s lemmatisation.
pdf
bib
abs
Annotating formulaic sequences in spoken Slovenian: structure, function and relevance
Kaja Dobrovoljc
Proceedings of the 13th Linguistic Annotation Workshop
This paper presents the identification of formulaic sequences in the reference corpus of spoken Slovenian and their annotation in terms of syntactic structure, pragmatic function and lexicographic relevance. The annotation campaign, specific in terms of setting, subjectivity and the multifunctionality of items under investigation, resulted in a preliminary lexicon of formulaic sequences in spoken Slovenian with immediate potential for future explorations in formulaic language research. This is especially relevant for the notable number of identified multi-word expressions with discourse-structuring and stance-marking functions, which have often been overlooked by traditional phraseology research.
pdf
bib
Improving UD processing via satellite resources for morphology
Kaja Dobrovoljc
|
Tomaž Erjavec
|
Nikola Ljubešić
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
2018
pdf
bib
abs
Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing
Kaja Dobrovoljc
|
Matej Martinc
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Despite the significant improvement of data-driven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speech-specific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected.
2017
pdf
bib
abs
The Universal Dependencies Treebank for Slovenian
Kaja Dobrovoljc
|
Tomaž Erjavec
|
Simon Krek
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.
2016
pdf
bib
abs
The Universal Dependencies Treebank of Spoken Slovenian
Kaja Dobrovoljc
|
Joakim Nivre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality.
2014
pdf
bib
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
Željko Agić
|
Jörg Tiedemann
|
Danijela Merkler
|
Simon Krek
|
Kaja Dobrovoljc
|
Sara Može
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants