2024
pdf
bib
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu
|
Anya Belz
|
Rudali Huidrom
|
Ehud Reiter
|
Joao Sedoc
|
Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
2023
pdf
bib
abs
Towards a Consensus Taxonomy for Annotating Errors in Automatically Generated Text
Rudali Huidrom
|
Anya Belz
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Error analysis aims to provide insights into system errors at different levels of granularity. NLP as a field has a long-standing tradition of analysing and reporting errors which is generally considered good practice. There are existing error taxonomies tailored for different types of NLP task. In this paper, we report our work reviewing existing research on meaning/content error types in generated text, attempt to identify emerging consensus among existing meaning/content error taxonomies, and propose a standardised error taxonomy on this basis. We find that there is virtually complete agreement at the highest taxonomic level where errors of meaning/content divide into (1) Content Omission, (2) Content Addition, and (3) Content Substitution. Consensus in the lower levels is less pronounced, but a compact standardised consensus taxonomy can nevertheless be derived that works across generation tasks and application domains.
2022
pdf
bib
abs
A Survey of Recent Error Annotation Schemes for Automatically Generated Text
Rudali Huidrom
|
Anya Belz
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
While automatically computing numerical scores remains the dominant paradigm in NLP system evaluation, error analysis is receiving increasing attention, with numerous error annotation schemes being proposed for automatically generated text. However, there is little agreement about what error annotation schemes should look like, how many different types of errors should be distinguished and at what level of granularity. In this paper, our aim is to map out recent work on annotating errors in automatically generated text, with a particular focus on error taxonomies. We describe our systematic paper selection process, and survey the error annotation schemes reported in the papers, drawing out similarities and differences between them. Finally, we characterise the issues that would make it difficult to move from the current situation to a standardised error taxonomy for annotating errors in automatically generated text.
pdf
bib
abs
Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System
Rudali Huidrom
|
Ondřej Dušek
|
Zdeněk Kasner
|
Thiago Castro Ferreira
|
Anya Belz
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
In this paper, we present the results of two reproduction studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against reference outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of reproducibility depending on result type, data and labelling task. Our resources are available and open-sourced.
pdf
bib
abs
Reproducing a Manual Evaluation of the Simplicity of Text Simplification System Outputs
Maja Popović
|
Sheila Castilho
|
Rudali Huidrom
|
Anya Belz
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
In this paper we describe our reproduction study of the human evaluation of text simplic- ity reported by Nisioi et al. (2017). The work was carried out as part of the ReproGen Shared Task 2022 on Reproducibility of Evaluations in NLG. Our aim was to repeat the evaluation of simplicity for nine automatic text simplification systems with a different set of evaluators. We describe our experimental design together with the known aspects of the original experimental design and present the results from both studies. Pearson correlation between the original and reproduction scores is moderate to high (0.776). Inter-annotator agreement in the reproduction study is lower (0.40) than in the original study (0.66). We discuss challenges arising from the unavailability of certain aspects of the origi- nal set-up, and make several suggestions as to how reproduction of similar evaluations can be made easier in future.
pdf
bib
abs
Introducing EM-FT for Manipuri-English Neural Machine Translation
Rudali Huidrom
|
Yves Lepage
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference
This paper introduces a pretrained word embedding for Manipuri, a low-resourced Indian language. The pretrained word embedding based on FastText is capable of handling the highly agglutinating language Manipuri (mni). We then perform machine translation (MT) experiments using neural network (NN) models. In this paper, we confirm the following observations. Firstly, the reported BLEU score of the Transformer architecture with FastText word embedding model EM-FT performs better than without in all the NMT experiments. Secondly, we observe that adding more training data from a different domain of the test data negatively impacts translation accuracy. The resources reported in this paper are made available in the ELRA catalogue to help the low-resourced languages community with MT/NLP tasks.
2021
pdf
bib
abs
EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English
Rudali Huidrom
|
Yves Lepage
|
Khogendra Khomdram
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
In this paper, we introduce a sentence-level comparable text corpus crawled and created for the less-resourced language pair, Manipuri(mni) and English (eng). Our monolingual corpora comprise 1.88 million Manipuri sentences and 1.45 million English sentences, and our parallel corpus comprises 124,975 Manipuri-English sentence pairs. These data were crawled and collected over a year from August 2020 to March 2021 from a local newspaper website called ‘The Sangai Express.’ The resources reported in this paper are made available to help the low-resourced languages community for MT/NLP tasks.
2020
pdf
bib
abs
Zero-shot translation among Indian languages
Rudali Huidrom
|
Yves Lepage
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Standard neural machine translation (NMT) allows a model to perform translation between a pair of languages. Multilingual neural machine translation (NMT), on the other hand, allows a model to perform translation between several language pairs, even between language pairs for which no sentences pair has been seen during training (zero-shot translation). This paper presents experiments with zero-shot translation on low resource Indian languages with a very small amount of data for each language pair. We first report results on balanced data over all considered language pairs. We then expand our experiments for additional three rounds by increasing the training data with 2,000 sentence pairs in each round for some of the language pairs. We obtain an increase in translation accuracy with its balanced data settings score multiplied by 7 for Manipuri to Hindi during Round-III of zero-shot translation.