SedarEval: Automated Evaluation using Self-Adaptive Rubrics

Zhiyuan Fan, Weinong Wang, Xing W, Debing Zhang


Abstract
The evaluation paradigm of LLM-as-judge gains popularity due to its significant reduction in human labor and time costs. This approach utilizes one or more large language models (LLMs) to assess the quality of outputs from other LLMs. However, existing methods rely on generic scoring rubrics that fail to consider the specificities of each question and its problem-solving process, compromising precision and stability in assessments. Inspired by human examination scoring processes, we propose a new evaluation paradigm based on self-adaptive rubrics. Specifically, we create detailed scoring rubrics for each question, capturing the primary and secondary criteria in a structured format of scoring and deduction points that mimic a human evaluator’s analytical process. Building on this paradigm, we further develop a novel benchmark called SedarEval, which covers a range of domains including long-tail knowledge, mathematics, coding, and logical reasoning. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. To further streamline the evaluation, we train a specialized evaluator language model (evaluator LM) to supplant human graders. Using the same training data, our evaluator LM achieves a higher concordance rate with human grading results than other paradigms, including GPT-4, highlighting the superiority and efficiency of our approach.
Anthology ID:
2024.findings-emnlp.984
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16916–16930
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.984/
DOI:
10.18653/v1/2024.findings-emnlp.984
Bibkey:
Cite (ACL):
Zhiyuan Fan, Weinong Wang, Xing W, and Debing Zhang. 2024. SedarEval: Automated Evaluation using Self-Adaptive Rubrics. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16916–16930, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
SedarEval: Automated Evaluation using Self-Adaptive Rubrics (Fan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.984.pdf