Haifeng Qian
2024
BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian
|
Sujan Kumar Gonugondla
|
Sungsoo Ha
|
Mingyue Shang
|
Sanjay Krishna Gouda
|
Ramesh Nallapati
|
Sudipta Sengupta
|
Xiaofei Ma
|
Anoop Deoras
Findings of the Association for Computational Linguistics ACL 2024
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15× speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3× the highest of that of regular decoding and around 10× of single-sequence speculative decoding.
2023
ReCode: Robustness Evaluation of Code Generation Models
Shiqi Wang
|
Zheng Li
|
Haifeng Qian
|
Chenghao Yang
|
Zijian Wang
|
Mingyue Shang
|
Varun Kumar
|
Samson Tan
|
Baishakhi Ray
|
Parminder Bhatia
|
Ramesh Nallapati
|
Murali Krishna Ramanathan
|
Dan Roth
|
Bing Xiang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model’s robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.
Search
Co-authors
- Mingyue Shang 2
- Ramesh Nallapati 2
- Sujan Kumar Gonugondla 1
- Sungsoo Ha 1
- Sanjay Krishna Gouda 1
- show all...