2024
pdf
bib
abs
Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices
Patricia Schmidtova
|
Saad Mahamood
|
Simone Balloccu
|
Ondrej Dusek
|
Albert Gatt
|
Dimitra Gkatzia
|
David M. Howcroft
|
Ondrej Platek
|
Adarsa Sivaprasad
Proceedings of the 17th International Natural Language Generation Conference
Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.
pdf
bib
abs
factgenie: A Framework for Span-based Evaluation of Generated Texts
Zdeněk Kasner
|
Ondrej Platek
|
Patricia Schmidtova
|
Simone Balloccu
|
Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations
We present ‘factgenie‘: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With ‘factgenie‘, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.
pdf
bib
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu
|
Zdeněk Kasner
|
Ondřej Plátek
|
Patrícia Schmidtová
|
Kristýna Onderková
|
Mateusz Lango
|
Ondřej Dušek
|
Lucie Flek
|
Ehud Reiter
|
Dimitra Gkatzia
|
Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
2023
pdf
bib
abs
With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector
Ondrej Platek
|
Mateusz Lango
|
Ondrej Dusek
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases statistically significant differences were observed, suggesting a high variability of human annotation.
pdf
bib
abs
TabGenie: A Toolkit for Table-to-Text Generation
Zdeněk Kasner
|
Ekaterina Garanina
|
Ondrej Platek
|
Ondrej Dusek
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie – a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all inputs are represented as tables with associated metadata. The tables can be explored through a web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package and provide its open-source code and a live demo at
https://github.com/kasnerz/tabgenie.
pdf
bib
abs
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz
|
Craig Thomson
|
Ehud Reiter
|
Gavin Abercrombie
|
Jose M. Alonso-Moral
|
Mohammad Arvan
|
Anouck Braggaar
|
Mark Cieliebak
|
Elizabeth Clark
|
Kees van Deemter
|
Tanvi Dinkar
|
Ondřej Dušek
|
Steffen Eger
|
Qixiang Fang
|
Mingqi Gao
|
Albert Gatt
|
Dimitra Gkatzia
|
Javier González-Corbelle
|
Dirk Hovy
|
Manuela Hürlimann
|
Takumi Ito
|
John D. Kelleher
|
Filip Klubicka
|
Emiel Krahmer
|
Huiyuan Lai
|
Chris van der Lee
|
Yiru Li
|
Saad Mahamood
|
Margot Mieskes
|
Emiel van Miltenburg
|
Pablo Mosteiro
|
Malvina Nissim
|
Natalie Parde
|
Ondřej Plátek
|
Verena Rieser
|
Jie Ruan
|
Joel Tetreault
|
Antonio Toral
|
Xiaojun Wan
|
Leo Wanner
|
Lewis Watson
|
Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
pdf
bib
abs
Three Ways of Using Large Language Models to Evaluate Chat
Ondřej Plátek
|
Vojtech Hudecek
|
Patricia Schmidtova
|
Mateusz Lango
|
Ondrej Dusek
Proceedings of The Eleventh Dialog System Technology Challenge
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.
2018
pdf
bib
Using Adversarial Examples in Natural Language Processing
Petr Bělohlávek
|
Ondřej Plátek
|
Zdeněk Žabokrtský
|
Milan Straka
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2014
pdf
bib
abs
Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license
Matěj Korvas
|
Ondřej Plátek
|
Ondřej Dušek
|
Lukáš Žilka
|
Filip Jurčíček
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech. Large part of the data, both audio and transcriptions, was collected using crowdsourcing, the rest are transcriptions by hired transcribers. We release the data together with scripts for data pre-processing and building acoustic models using the HTK and Kaldi ASR toolkits. We publish also the trained models described in this paper. The data are released under the CC-BY-SA 3.0 license, the scripts are licensed under Apache 2.0. In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and evaluating acoustic models using the presented data and scripts.
pdf
bib
Alex: Bootstrapping a Spoken Dialogue System for a New Domain by Real Users
Ondřej Dušek
|
Ondřej Plátek
|
Lukáš Žilka
|
Filip Jurčíček
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
pdf
bib
Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices
Ondřej Plátek
|
Filip Jurčíček
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)