Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization

Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy Huang


Abstract
Debatepedia is a publicly available dataset consisting of arguments and counter-arguments on controversial topics that has been widely used for the single-document query-focused abstractive summarization task in recent years. However, it has been recently found that this dataset is limited by noise and even most queries in this dataset do not have any relevance to the respective document. In this paper, we study whether large language models (LLMs) can be utilized to clean the Debatepedia dataset to make it suitable for query-focused abstractive summarization. More specifically, we harness the language generation capabilities of two LLMs, namely, ChatGPT and PaLM to regenerate its queries. Based on our experiments, we find that solely depending on large language models for query correction may not be very useful for data cleaning. However, we observe that leveraging a rule-based approach for data sampling followed by query regeneration using LLMs (especially ChatGPT) for the sampled instances may ensure a higher quality version of this dataset suitable for the development of more generalized query-focused text summarization models.
Anthology ID:
2023.findings-emnlp.686
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10245–10255
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.686
DOI:
10.18653/v1/2023.findings-emnlp.686
Bibkey:
Cite (ACL):
Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, and Jimmy Huang. 2023. Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10245–10255, Singapore. Association for Computational Linguistics.
Cite (Informal):
Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization (Laskar et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.686.pdf