Sreyashi Nag
2024
IterAlign: Iterative Constitutional Alignment of Large Language Models
Xiusi Chen
|
Hongzhi Wen
|
Sreyashi Nag
|
Chen Luo
|
Qingyu Yin
|
Ruirui Li
|
Zheng Li
|
Wei Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness.
2023
Improving Consistency for Text Summarization with Energy Functions
Qi Zeng
|
Qingyu Yin
|
Zheng Li
|
Yifan Gao
|
Sreyashi Nag
|
Zhengyang Wang
|
Bing Yin
|
Heng Ji
|
Chao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023
Current abstractive summarization models often generate inconsistent content, i.e. texts that are not directly inferable from the source document, are not consistent with respect to world knowledge, or are self-contradictory. These inconsistencies motivate a new consistency taxonomy that we define as faithfulness, factuality, and self-supportiveness. However, most recent work on reducing inconsistency in document summarization only focuses on faithfulness detection and correction while ignoring other inconsistency phenomena, which limits the model’s scalability. To improve the general consistency we introduce EnergySum, where we apply the Residual Energy-based Model by designing energy scorers that reflect each type of consistency. These energy scores are utilized in candidate re-ranking during the sampling process. Experiments on XSUM and CNN/DM datasets show that EnergySum mitigates the trade-off between accuracy and consistency.
Search
Co-authors
- Qingyu Yin 2
- Zheng Li 2
- Qi Zeng 1
- Yifan Gao 1
- Zhengyang Wang 1
- show all...