2025
pdf
bib
abs
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
|
Juyoung Suk
|
Ji Yong Cho
|
Shayne Longpre
|
Chaeeun Kim
|
Dongkeun Yoon
|
Guijin Son
|
Yejin Cho
|
Sheikh Shafayat
|
Jinheon Baek
|
Sue Hyun Park
|
Hyeonbin Hwang
|
Jinkyung Jo
|
Hyowon Cho
|
Haebin Shin
|
Seongyun Lee
|
Hanseok Oh
|
Noah Lee
|
Namgyu Ho
|
Se June Joo
|
Miyoung Ko
|
Yoonjoo Lee
|
Hyungjoo Chae
|
Jamin Shin
|
Joel Jang
|
Seonghyeon Ye
|
Bill Yuchen Lin
|
Sean Welleck
|
Graham Neubig
|
Moontae Lee
|
Kyungjae Lee
|
Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
pdf
bib
abs
Generative Prompt Internalization
Haebin Shin
|
Lei Ji
|
Yeyun Gong
|
Sungdong Kim
|
Eunbi Choi
|
Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model’s behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.
2024
pdf
bib
abs
KTRL+F: Knowledge-Augmented In-Document Search
Hanseok Oh
|
Haebin Shin
|
Miyoung Ko
|
Hyunji Lee
|
Minjoon Seo
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We introduce a new problem KTRL+F, a knowledge-augmented in-document search that necessitates real-time identification of all semantic targets within a document with the awareness of external sources through a single natural query. KTRL+F addresses following unique challenges for in-document search: 1) utilizing knowledge outside the document for extended use of additional information about targets, and 2) balancing between real-time applicability with the performance.We analyze various baselines in KTRL+F and find limitations of existing models, such as hallucinations, high latency, or difficulties in leveraging external knowledge. Therefore, we propose a Knowledge-Augmented Phrase Retrieval model that shows a promising balance between speed and performance by simply augmenting external knowledge in phrase embedding. We also conduct a user study to verify whether solving KTRL+F can enhance search experience for users. It demonstrates that even with our simple model, users can reduce the time for searching with less queries and reduced extra visits to other sources for collecting evidence. We encourage the research community to work on KTRL+F to enhance more efficient in-document information access.
2022
pdf
bib
abs
Learning to Embed Multi-Modal Contexts for Situated Conversational Agents
Haeju Lee
|
Oh Joon Kwon
|
Yunseon Choi
|
Minho Park
|
Ran Han
|
Yoonhyung Kim
|
Jinhyeon Kim
|
Youngjune Lee
|
Haebin Shin
|
Kangwook Lee
|
Kee-Eung Kim
Findings of the Association for Computational Linguistics: NAACL 2022
The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.