Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu, Qinyuan Cheng, Runyu Peng, Xiaonan Li, Ru Peng, Tengxiao Liu, Xipeng Qiu, Xuanjing Huang


Abstract
The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs’ true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
Anthology ID:
2024.findings-emnlp.532
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9113–9129
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.532/
DOI:
10.18653/v1/2024.findings-emnlp.532
Bibkey:
Cite (ACL):
Qin Zhu, Qinyuan Cheng, Runyu Peng, Xiaonan Li, Ru Peng, Tengxiao Liu, Xipeng Qiu, and Xuanjing Huang. 2024. Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9113–9129, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation (Zhu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.532.pdf