2024
pdf
bib
abs
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Jiale Cheng
|
Xiao Liu
|
Kehan Zheng
|
Pei Ke
|
Hongning Wang
|
Yuxiao Dong
|
Jie Tang
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods primarily focus on further training them. However, the extra training of LLMs is usually expensive in terms of GPU computing; even worse, some LLMs are not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective—Black-Box Prompt Optimization (BPO)—to perform alignments. The idea is to optimize user prompts to suit LLMs’ input understanding, so as to best realize users’ intents without updating LLMs’ parameters. BPO leverages human preferences to optimize prompts, thus making it superior to LLM (e.g., ChatGPT) as a prompt engineer. Moreover, BPO is model-agnostic, and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version and 10% for GPT-4. Notably, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.
pdf
bib
abs
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang
|
Junxiao Yang
|
Pei Ke
|
Fei Mi
|
Hongning Wang
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs’ capability and safety. Our code is available at https://github.com/thu-coai/JailbreakDefense_GoalPriority.
pdf
bib
abs
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Xiao Liu
|
Xuanyu Lei
|
Shengyuan Wang
|
Yue Huang
|
Andrew Feng
|
Bosi Wen
|
Jiale Cheng
|
Pei Ke
|
Yifan Xu
|
Weng Lam Tam
|
Xiaohan Zhang
|
Lichao Sun
|
Xiaotao Gu
|
Hongning Wang
|
Jing Zhang
|
Minlie Huang
|
Yuxiao Dong
|
Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at
https://github.com/THUDM/AlignBenchpdf
bib
abs
Learning Task Decomposition to Assist Humans in Competitive Programming
Jiaxin Wen
|
Ruiqi Zhong
|
Pei Ke
|
Zhihong Shao
|
Hongning Wang
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (AssistV), which measures the feasibility and speed for humans to repair the decomposed solution. We collect a dataset of human repair experiences on different decomposed solutions. Utilizing the collected data as in-context examples, we then learn to critique, refine, and rank decomposed solutions to improve AssistV. We validate our method under competitive programming problems: under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
pdf
bib
abs
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Pei Ke
|
Bosi Wen
|
Andrew Feng
|
Xiao Liu
|
Xuanyu Lei
|
Jiale Cheng
|
Shengyuan Wang
|
Aohan Zeng
|
Yuxiao Dong
|
Hongning Wang
|
Jie Tang
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4’s direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
2023
pdf
bib
abs
COFFEE: Counterfactual Fairness for Personalized Text Generation in Explainable Recommendation
Nan Wang
|
Qifan Wang
|
Yi-Chia Wang
|
Maziar Sanjabi
|
Jingzhou Liu
|
Hamed Firooz
|
Hongning Wang
|
Shaoliang Nie
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
As language models become increasingly integrated into our digital lives, Personalized Text Generation (PTG) has emerged as a pivotal component with a wide range of applications. However, the bias inherent in user written text, often used for PTG model training, can inadvertently associate different levels of linguistic quality with users’ protected attributes. The model can inherit the bias and perpetuate inequality in generating text w.r.t. users’ protected attributes, leading to unfair treatment when serving users. In this work, we investigate fairness of PTG in the context of personalized explanation generation for recommendations. We first discuss the biases in generated explanations and their fairness implications. To promote fairness, we introduce a general framework to achieve measure-specific counterfactual fairness in explanation generation. Extensive experiments and human evaluations demonstrate the effectiveness of our method.
2019
pdf
bib
abs
Adversarial Domain Adaptation for Machine Reading Comprehension
Huazheng Wang
|
Zhe Gan
|
Xiaodong Liu
|
Jingjing Liu
|
Jianfeng Gao
|
Hongning Wang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where (i) pseudo questions are first generated for unlabeled passages in the target domain, and then (ii) a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from. The classifier and the passage-question encoder are jointly trained using adversarial learning to enforce domain-invariant representation learning. Comprehensive evaluations demonstrate that our approach (i) is generalizable to different MRC models and datasets, (ii) can be combined with pre-trained large-scale language models (such as ELMo and BERT), and (iii) can be extended to semi-supervised learning.
2016
pdf
bib
Modeling Social Norms Evolution for Personalized Sentiment Classification
Lin Gong
|
Mohammad Al Boni
|
Hongning Wang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2015
pdf
bib
Model Adaptation for Personalized Opinion Analysis
Mohammad Al Boni
|
Keira Zhou
|
Hongning Wang
|
Matthew S. Gerber
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
2011
pdf
bib
Structural Topic Model for Latent Topical Structure Analysis
Hongning Wang
|
Duo Zhang
|
ChengXiang Zhai
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
2010
pdf
bib
Exploiting Structured Ontology to Organize Scattered Online Opinions
Yue Lu
|
Huizhong Duan
|
Hongning Wang
|
ChengXiang Zhai
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)