Compare without Despair: Reliable Preference Evaluation with Generation Separability

Sayan Ghosh, Tejas Srinivasan, Swabha Swayamdipta


Abstract
Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.
Anthology ID:
2024.findings-emnlp.747
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12787–12805
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.747/
DOI:
10.18653/v1/2024.findings-emnlp.747
Bibkey:
Cite (ACL):
Sayan Ghosh, Tejas Srinivasan, and Swabha Swayamdipta. 2024. Compare without Despair: Reliable Preference Evaluation with Generation Separability. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12787–12805, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Compare without Despair: Reliable Preference Evaluation with Generation Separability (Ghosh et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.747.pdf