2024
pdf
bib
abs
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu
|
Barlas Oguz
|
Changsheng Zhao
|
Ernie Chang
|
Pierre Stock
|
Yashar Mehdad
|
Yangyang Shi
|
Raghuraman Krishnamoorthi
|
Vikas Chandra
Findings of the Association for Computational Linguistics ACL 2024
Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization-aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and supporting long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.
pdf
bib
abs
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Ganesh Jawahar
|
Haichuan Yang
|
Yunyang Xiong
|
Zechun Liu
|
Dilin Wang
|
Fei Sun
|
Meng Li
|
Aasish Pappu
|
Barlas Oguz
|
Muhammad Abdul-Mageed
|
Laks Lakshmanan
|
Raghuraman Krishnamoorthi
|
Vikas Chandra
Findings of the Association for Computational Linguistics ACL 2024
Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification.This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.
2023
pdf
bib
abs
Revisiting Sample Size Determination in Natural Language Understanding
Ernie Chang
|
Muhammad Hassan Rashid
|
Pin-Jie Lin
|
Changsheng Zhao
|
Vera Demberg
|
Yangyang Shi
|
Vikas Chandra
Findings of the Association for Computational Linguistics: ACL 2023
Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples – which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~0.9%) with only 10% data.
pdf
bib
abs
Towards Zero-Shot Multilingual Transfer for Code-Switched Responses
Ting-Wei Wu
|
Changsheng Zhao
|
Ernie Chang
|
Yangyang Shi
|
Pierce Chuang
|
Vikas Chandra
|
Biing Juang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent task-oriented dialog systems have had great success in building English-based personal assistants, but extending these systems to a global audience is challenging due to the need for annotated data in the target language. An alternative approach is to leverage existing data in a high-resource language to enable cross-lingual transfer in low-resource language models. However, this type of transfer has not been widely explored in natural language response generation. In this research, we investigate the use of state-of-the-art multilingual models such as mBART and T5 to facilitate zero-shot and few-shot transfer of code-switched responses. We propose a new adapter-based framework that allows for efficient transfer by learning task-specific representations and encapsulating source and target language representations. Our framework is able to successfully transfer language knowledge even when the target language corpus is limited. We present both quantitative and qualitative analyses to evaluate the effectiveness of our approach.