EPO: Hierarchical LLM Agents with Environment Preference Optimization

Qi Zhao, Haotian Fu, Chen Sun, George Konidaris


Abstract
Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment’s feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.
Anthology ID:
2024.emnlp-main.367
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6401–6415
Language:
URL:
https://aclanthology.org/2024.emnlp-main.367/
DOI:
10.18653/v1/2024.emnlp-main.367
Bibkey:
Cite (ACL):
Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. 2024. EPO: Hierarchical LLM Agents with Environment Preference Optimization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6401–6415, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
EPO: Hierarchical LLM Agents with Environment Preference Optimization (Zhao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.367.pdf