Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang
Correct Metadata for
Abstract
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.- Anthology ID:
- 2023.insights-1.1
- Original:
- 2023.insights-1.1v1
- Version 2:
- 2023.insights-1.1v2
- Volume:
- Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
- Venues:
- insights | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–10
- Language:
- URL:
- https://aclanthology.org/2023.insights-1.1/
- DOI:
- 10.18653/v1/2023.insights-1.1
- Bibkey:
- Cite (ACL):
- Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, and Diyi Yang. 2023. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP (Belz et al., insights 2023)
- Copy Citation:
- PDF:
- https://aclanthology.org/2023.insights-1.1.pdf
- Video:
- https://aclanthology.org/2023.insights-1.1.mp4
Export citation
@inproceedings{belz-etal-2023-missing, title = "Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in {NLP}", author = {Belz, Anya and Thomson, Craig and Reiter, Ehud and Abercrombie, Gavin and Alonso-Moral, Jose M. and Arvan, Mohammad and Braggaar, Anouck and Cieliebak, Mark and Clark, Elizabeth and van Deemter, Kees and Dinkar, Tanvi and Du{\v{s}}ek, Ond{\v{r}}ej and Eger, Steffen and Fang, Qixiang and Gao, Mingqi and Gatt, Albert and Gkatzia, Dimitra and Gonz{\'a}lez-Corbelle, Javier and Hovy, Dirk and H{\"u}rlimann, Manuela and Ito, Takumi and Kelleher, John D. and Klubicka, Filip and Krahmer, Emiel and Lai, Huiyuan and van der Lee, Chris and Li, Yiru and Mahamood, Saad and Mieskes, Margot and van Miltenburg, Emiel and Mosteiro, Pablo and Nissim, Malvina and Parde, Natalie and Pl{\'a}tek, Ond{\v{r}}ej and Rieser, Verena and Ruan, Jie and Tetreault, Joel and Toral, Antonio and Wan, Xiaojun and Wanner, Leo and Watson, Lewis and Yang, Diyi}, editor = "Tafreshi, Shabnam and Akula, Arjun and Sedoc, Jo{\~a}o and Drozd, Aleksandr and Rogers, Anna and Rumshisky, Anna", booktitle = "Proceedings of the Fourth Workshop on Insights from Negative Results in NLP", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.insights-1.1/", doi = "10.18653/v1/2023.insights-1.1", pages = "1--10", abstract = "We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13{\%} of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP." }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="belz-etal-2023-missing"> <titleInfo> <title>Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP</title> </titleInfo> <name type="personal"> <namePart type="given">Anya</namePart> <namePart type="family">Belz</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Craig</namePart> <namePart type="family">Thomson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ehud</namePart> <namePart type="family">Reiter</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gavin</namePart> <namePart type="family">Abercrombie</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jose</namePart> <namePart type="given">M</namePart> <namePart type="family">Alonso-Moral</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mohammad</namePart> <namePart type="family">Arvan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anouck</namePart> <namePart type="family">Braggaar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mark</namePart> <namePart type="family">Cieliebak</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Elizabeth</namePart> <namePart type="family">Clark</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kees</namePart> <namePart type="family">van Deemter</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tanvi</namePart> <namePart type="family">Dinkar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Dušek</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Steffen</namePart> <namePart type="family">Eger</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Qixiang</namePart> <namePart type="family">Fang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mingqi</namePart> <namePart type="family">Gao</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Albert</namePart> <namePart type="family">Gatt</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dimitra</namePart> <namePart type="family">Gkatzia</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Javier</namePart> <namePart type="family">González-Corbelle</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dirk</namePart> <namePart type="family">Hovy</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Manuela</namePart> <namePart type="family">Hürlimann</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Takumi</namePart> <namePart type="family">Ito</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">John</namePart> <namePart type="given">D</namePart> <namePart type="family">Kelleher</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Filip</namePart> <namePart type="family">Klubicka</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Emiel</namePart> <namePart type="family">Krahmer</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Huiyuan</namePart> <namePart type="family">Lai</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chris</namePart> <namePart type="family">van der Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yiru</namePart> <namePart type="family">Li</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Saad</namePart> <namePart type="family">Mahamood</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Margot</namePart> <namePart type="family">Mieskes</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Emiel</namePart> <namePart type="family">van Miltenburg</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pablo</namePart> <namePart type="family">Mosteiro</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Malvina</namePart> <namePart type="family">Nissim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Natalie</namePart> <namePart type="family">Parde</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Plátek</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Verena</namePart> <namePart type="family">Rieser</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jie</namePart> <namePart type="family">Ruan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joel</namePart> <namePart type="family">Tetreault</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Antonio</namePart> <namePart type="family">Toral</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xiaojun</namePart> <namePart type="family">Wan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Leo</namePart> <namePart type="family">Wanner</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lewis</namePart> <namePart type="family">Watson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Diyi</namePart> <namePart type="family">Yang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-05</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Fourth Workshop on Insights from Negative Results in NLP</title> </titleInfo> <name type="personal"> <namePart type="given">Shabnam</namePart> <namePart type="family">Tafreshi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Arjun</namePart> <namePart type="family">Akula</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">João</namePart> <namePart type="family">Sedoc</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Aleksandr</namePart> <namePart type="family">Drozd</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anna</namePart> <namePart type="family">Rogers</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anna</namePart> <namePart type="family">Rumshisky</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Dubrovnik, Croatia</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.</abstract> <identifier type="citekey">belz-etal-2023-missing</identifier> <identifier type="doi">10.18653/v1/2023.insights-1.1</identifier> <location> <url>https://aclanthology.org/2023.insights-1.1/</url> </location> <part> <date>2023-05</date> <extent unit="page"> <start>1</start> <end>10</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP %A Belz, Anya %A Thomson, Craig %A Reiter, Ehud %A Abercrombie, Gavin %A Alonso-Moral, Jose M. %A Arvan, Mohammad %A Braggaar, Anouck %A Cieliebak, Mark %A Clark, Elizabeth %A van Deemter, Kees %A Dinkar, Tanvi %A Dušek, Ondřej %A Eger, Steffen %A Fang, Qixiang %A Gao, Mingqi %A Gatt, Albert %A Gkatzia, Dimitra %A González-Corbelle, Javier %A Hovy, Dirk %A Hürlimann, Manuela %A Ito, Takumi %A Kelleher, John D. %A Klubicka, Filip %A Krahmer, Emiel %A Lai, Huiyuan %A van der Lee, Chris %A Li, Yiru %A Mahamood, Saad %A Mieskes, Margot %A van Miltenburg, Emiel %A Mosteiro, Pablo %A Nissim, Malvina %A Parde, Natalie %A Plátek, Ondřej %A Rieser, Verena %A Ruan, Jie %A Tetreault, Joel %A Toral, Antonio %A Wan, Xiaojun %A Wanner, Leo %A Watson, Lewis %A Yang, Diyi %Y Tafreshi, Shabnam %Y Akula, Arjun %Y Sedoc, João %Y Drozd, Aleksandr %Y Rogers, Anna %Y Rumshisky, Anna %S Proceedings of the Fourth Workshop on Insights from Negative Results in NLP %D 2023 %8 May %I Association for Computational Linguistics %C Dubrovnik, Croatia %F belz-etal-2023-missing %X We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP. %R 10.18653/v1/2023.insights-1.1 %U https://aclanthology.org/2023.insights-1.1/ %U https://doi.org/10.18653/v1/2023.insights-1.1 %P 1-10
Markdown (Informal)
[Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP](https://aclanthology.org/2023.insights-1.1/) (Belz et al., insights 2023)
- Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP (Belz et al., insights 2023)
ACL
- Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, and Diyi Yang. 2023. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.