2025
pdf
bib
abs
The evolution of relative clauses in the IcePaHC treebank
Anton Ingason
|
Johanna Mechler
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
We examine how the elements that introduce relative clauses, namely relative complementizers and relative pronouns, evolve over the history of Icelandic using the phrase structure analysis of the IcePaHC treebank. The rate of these elements changes over time and, in the case of relative pronouns, is subject to effects of genre and the type of gap in the relative clause in question. Our paper is a digital humanities study of historical linguistics which would not be possible without a parsed corpus that spans all centuries involved in the change. We relate our findings to studies on the Constant Rate Effect by analyzing these effects in detail.
pdf
bib
abs
Language use of political parties over time: Stylistic Fronting in the Icelandic Gigaword Corpus
Johanna Mechler
|
Lilja Björk Stefánsdóttir
|
Anton Ingason
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Political speech is an active area of investigation and the ongoing ERC project Explaining Individual Lifespan Change (EILisCh) expands on some of the previous findings in this area. Previous work has found that political speech can differ based on party membership in a time-wise static environment and it has also been uncovered that individual politicians can change their linguistic behavior over time. In this paper, we pursue a novel topic in this area, the evolution of language use of entire political parties over time. We focus on Icelandic political parties and their use of Stylistic Fronting from 1999 to 2021, with a particular emphasis on the years around the financial crisis of 2008, and the subsequent years. Our results show that parties in a position of power typically speak more formally, using more Stylistic Fronting, but that at the same time there are some exceptions to this pattern. We highlight the significance of relying on a large speech corpus, when applying a high-definition approach to linguistic analyses across time.
2023
pdf
bib
abs
Evaluating a Universal Dependencies Conversion Pipeline for Icelandic
Þórunn Arnardóttir
|
Hinrik Hafsteinsson
|
Atli Jasonarson
|
Anton Ingason
|
Steinþór Steingrímsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.
2022
pdf
bib
abs
Error Corpora for Different Informant Groups:Annotating and Analyzing Texts from L2 Speakers, People with Dyslexia and Children
Þórunn Arnardóttir
|
Isidora Glisic
|
Annika Simonsen
|
Lilja Stefánsdóttir
|
Anton Ingason
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Error corpora are useful for many tasks, in particular for developing spell and grammar checking software and teaching material and tools. We present and compare three specialized Icelandic error corpora; the Icelandic L2 Error Corpus, the Icelandic Dyslexia Error Corpus, and the Icelandic Child Language Error Corpus. Each corpus contains texts written by speakers of a particular group; L2 speakers of Icelandic, people with dyslexia, and children aged 10 to 15. The corpora shed light on errors made by these groups and their frequencies, and all errors are manually labeled according to an annotation scheme. The corpora vary in size, consisting of errors ranging from 7,817 to 24,948, and are published under a CC BY 4.0 license. In this paper, we describe the corpora and their annotation scheme, and draw comparisons between their errors and their frequencies.
pdf
bib
abs
Mean Machine Translations: On Gender Bias in Icelandic Machine Translations
Agnes Sólmundsdóttir
|
Dagbjört Guðmundsdóttir
|
Lilja Björk Stefánsdóttir
|
Anton Ingason
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper examines machine bias in language technology. Machine bias can affect machine learning algorithms when language models trained on large corpora include biased human decisions or reflect historical or social inequities, e.g. regarding gender and race. The focus of the paper is on gender bias in machine translation and we discuss a study conducted on Icelandic translations in the translation systems Google Translate and Vélþýðing.is. The results show a pattern which corresponds to certain societal ideas about gender. For example it seems to depend on the meaning of adjectives referring to people whether they appear in the masculine or feminine form. Adjectives describing positive personality traits were more likely to appear in masculine gender whereas the negative ones frequently appear in feminine gender. However, the opposite applied to appearance related adjectives. These findings unequivocally demonstrate the importance of being vigilant towards technology so as not to maintain societal inequalities and outdated views — especially in today’s digital world.
pdf
bib
abs
Developing a Spell and Grammar Checker for Icelandic using an Error Corpus
Hulda Óladóttir
|
Þórunn Arnardóttir
|
Anton Ingason
|
Vilhjálmur Þorsteinsson
Proceedings of the Thirteenth Language Resources and Evaluation Conference
A lack of datasets for spelling and grammatical error correction in Icelandic, along with language-specific issues, has caused a dearth of spell and grammar checking systems for the language. We present the first open-source spell and grammar checking tool for Icelandic, using an error corpus at all stages. This error corpus was in part created to aid in the development of the tool. The system is built with a rule-based tool stack comprising a tokenizer, a morphological tagger, and a parser. For token-level error annotation, tokenization rules, word lists, and a trigram model are used in error detection and correction. For sentence-level error annotation, we use specific error grammar rules in the parser as well as regex-like patterns to search syntax trees. The error corpus gives valuable insight into the errors typically made when Icelandic text is written, and guided each development phase in a test-driven manner. We assess the system’s performance with both automatic and human evaluation, using the test set in the error corpus as a reference in the automatic evaluation. The data in the error corpus development set proved useful in various ways for error detection and correction.