2024
pdf
bib
abs
BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian
|
Sujan Kumar Gonugondla
|
Sungsoo Ha
|
Mingyue Shang
|
Sanjay Krishna Gouda
|
Ramesh Nallapati
|
Sudipta Sengupta
|
Xiaofei Ma
|
Anoop Deoras
Findings of the Association for Computational Linguistics ACL 2024
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15× speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3× the highest of that of regular decoding and around 10× of single-sequence speculative decoding.
pdf
bib
abs
Token Alignment via Character Matching for Subword Completion
Ben Athiwaratkun
|
Shiqi Wang
|
Mingyue Shang
|
Yuchen Tian
|
Zijian Wang
|
Sujan Kumar Gonugondla
|
Sanjay Krishna Gouda
|
Robert Kwiatkowski
|
Ramesh Nallapati
|
Parminder Bhatia
|
Bing Xiang
Findings of the Association for Computational Linguistics ACL 2024
Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model’s generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text.
2023
pdf
bib
abs
ReCode: Robustness Evaluation of Code Generation Models
Shiqi Wang
|
Zheng Li
|
Haifeng Qian
|
Chenghao Yang
|
Zijian Wang
|
Mingyue Shang
|
Varun Kumar
|
Samson Tan
|
Baishakhi Ray
|
Parminder Bhatia
|
Ramesh Nallapati
|
Murali Krishna Ramanathan
|
Dan Roth
|
Bing Xiang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model’s robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.
pdf
bib
abs
Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning
Alexander Hanbo Li
|
Mingyue Shang
|
Evangelia Spiliopoulou
|
Jie Ma
|
Patrick Ng
|
Zhiguo Wang
|
Bonan Min
|
William Yang Wang
|
Kathleen McKeown
|
Vittorio Castelli
|
Dan Roth
|
Bing Xiang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this paper, we present a novel approach for data-to-text generation that addresses the limitations of current methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.
2021
pdf
bib
abs
Entity Resolution in Open-domain Conversations
Mingyue Shang
|
Tong Wang
|
Mihail Eric
|
Jiangning Chen
|
Jiyang Wang
|
Matthew Welch
|
Tiantong Deng
|
Akshay Grewal
|
Han Wang
|
Yue Liu
|
Yang Liu
|
Dilek Hakkani-Tur
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
In recent years, incorporating external knowledge for response generation in open-domain conversation systems has attracted great interest. To improve the relevancy of retrieved knowledge, we propose a neural entity linking (NEL) approach. Different from formal documents, such as news, conversational utterances are informal and multi-turn, which makes it more challenging to disambiguate the entities. Therefore, we present a context-aware named entity recognition model (NER) and entity resolution (ER) model to utilize dialogue context information. We conduct NEL experiments on three open-domain conversation datasets and validate that incorporating context information improves the performance of NER and ER models. The end-to-end NEL approach outperforms the baseline by 62.8% relatively in F1 metric. Furthermore, we verify that using external knowledge based on NEL benefits the neural response generation model.
2019
pdf
bib
abs
Who Is Speaking to Whom? Learning to Identify Utterance Addressee in Multi-Party Conversations
Ran Le
|
Wenpeng Hu
|
Mingyue Shang
|
Zhenjun You
|
Lidong Bing
|
Dongyan Zhao
|
Rui Yan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Previous research on dialogue systems generally focuses on the conversation between two participants, yet multi-party conversations which involve more than two participants within one session bring up a more complicated but realistic scenario. In real multi- party conversations, we can observe who is speaking, but the addressee information is not always explicit. In this paper, we aim to tackle the challenge of identifying all the miss- ing addressees in a conversation session. To this end, we introduce a novel who-to-whom (W2W) model which models users and utterances in the session jointly in an interactive way. We conduct experiments on the benchmark Ubuntu Multi-Party Conversation Corpus and the experimental results demonstrate that our model outperforms baselines with consistent improvements.
pdf
bib
abs
Semi-supervised Text Style Transfer: Cross Projection in Latent Space
Mingyue Shang
|
Piji Li
|
Zhenxin Fu
|
Lidong Bing
|
Dongyan Zhao
|
Shuming Shi
|
Rui Yan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Text style transfer task requires the model to transfer a sentence of one style to another style while retaining its original content meaning, which is a challenging problem that has long suffered from the shortage of parallel data. In this paper, we first propose a semi-supervised text style transfer model that combines the small-scale parallel data with the large-scale nonparallel data. With these two types of training data, we introduce a projection function between the latent space of different styles and design two constraints to train it. We also introduce two other simple but effective semi-supervised methods to compare with. To evaluate the performance of the proposed methods, we build and release a novel style transfer dataset that alters sentences between the style of ancient Chinese poem and the modern Chinese.