2024
pdf
bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Yi Yang
|
Aida Davani
|
Avi Sil
|
Anoop Kumar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
pdf
bib
abs
Leveraging LLMs for Dialogue Quality Measurement
Jinghan Jia
|
Abi Komma
|
Timothy Leffel
|
Xujun Peng
|
Ajay Nagesh
|
Tamer Soliman
|
Aram Galstyan
|
Anoop Kumar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zero- and few-shot capabilities across NLP tasks. Our paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine “chain-of-thought” (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection,; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. In addition, we find that suitably tuned LLMs exhibit high accuracy in dialogue evaluation compared to human judgments.
pdf
bib
abs
Correcting Language Model Outputs by Editing Salient Layers
Kshitij Mishra
|
Tamer Soliman
|
Anil Ramakrishna
|
Aram Galstyan
|
Anoop Kumar
Findings of the Association for Computational Linguistics: EACL 2024
Large language models can accumulate incorrect or outdated knowledge as the real world evolves. Compared to typical solutions such as retraining, retrieval augmented generation, model editing offers an effective yet low cost solution to address this issue. However, existing model editing algorithms employ manual selection of edit layers, which requires prior domain knowledge or expensive architecture-specific empirical layer selection methods, such as causal tracing. In this work, we propose SaLEM (Salient Layers Editing Model), an efficient solution for data driven layer selection for the model editing task. Our solution utilizes layer-wise saliency maps for layer selection, and matches the accuracy of prior approaches but with only 1/3 of their edits, enabling efficient updates to the parametric knowledge in large language models.
pdf
bib
abs
Prompt Perturbation Consistency Learning for Robust Language Models
Yao Qiang
|
Subhrangshu Nandi
|
Ninareh Mehrabi
|
Greg Ver Steeg
|
Anoop Kumar
|
Anna Rumshisky
|
Aram Galstyan
Findings of the Association for Computational Linguistics: EACL 2024
Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments show that PPCL can recover on an average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats data augmentation approach while using ten times fewer augmented data samples.
pdf
bib
abs
Agenda-Driven Question Generation: A Case Study in the Courtroom Domain
Yi Fung
|
Anoop Kumar
|
Aram Galstyan
|
Heng Ji
|
Prem Natarajan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces a novel problem of automated question generation for courtroom examinations, CourtQG. While question generation has been studied in domains such as educational testing and product description, CourtQG poses several unique challenges owing to its non-cooperative and agenda-driven nature. Specifically, not only the generated questions need to be relevant to the case and underlying context, they also have to achieve certain objectives such as challenging the opponent’s arguments and/or revealing potential inconsistencies in their answers. We propose to leverage large language models (LLM) for CourtQG by fine-tuning them on two auxiliary tasks, agenda explanation (i.e., uncovering the underlying intents) and question type prediction. We additionally propose cold-start generation of questions from background documents without relying on examination history. We construct a dataset to evaluate our proposed method and show that it generates better questions according to standard metrics when compared to several baselines.
pdf
bib
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Anaelia Ovalle
|
Kai-Wei Chang
|
Yang Trista Cao
|
Ninareh Mehrabi
|
Jieyu Zhao
|
Aram Galstyan
|
Jwala Dhamala
|
Anoop Kumar
|
Rahul Gupta
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
2023
pdf
bib
abs
Measuring and Mitigating Local Instability in Deep Neural Networks
Arghya Datta
|
Subhrangshu Nandi
|
Jingcheng Xu
|
Greg Ver Steeg
|
He Xie
|
Anoop Kumar
|
Aram Galstyan
Findings of the Association for Computational Linguistics: ACL 2023
Deep Neural Networks (DNNs) are becoming integral components of real world services relied upon by millions of users. Unfortunately, architects of these systems can find it difficult to ensure reliable performance as irrelevant details like random initialization can unexpectedly change the outputs of a trained system with potentially disastrous consequences. We formulate the model stability problem by studying how the predictions of a model change, even when it is retrained on the same data, as a consequence of stochasticity in the training process. For Natural Language Understanding (NLU) tasks, we find instability in predictions for a significant fraction of queries. We formulate principled metrics, like per-sample “label entropy” across training runs or within a single training run, to quantify this phenomenon. Intriguingly, we find that unstable predictions do not appear at random, but rather appear to be clustered in data-specific ways. We study data-agnostic regularization methods to improve stability and propose new data-centric methods that exploit our local stability estimates. We find that our localized data-specific mitigation strategy dramatically outperforms data-agnostic methods, and comes within 90% of the gold standard, achieved by ensembling, at a fraction of the computational cost.
pdf
bib
abs
Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Neal Lawton
|
Anoop Kumar
|
Govind Thattai
|
Aram Galstyan
|
Greg Ver Steeg
Findings of the Association for Computational Linguistics: ACL 2023
Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model parameters to the pre-trained network. Hand-designed PET architectures from the literature perform well in practice, but have the potential to be improved via automated neural architecture search (NAS). We propose an efficient NAS method for learning PET architectures via structured and unstructured pruning. We present experiments on GLUE demonstrating the effectiveness of our algorithm and discuss how PET architectural design choices affect performance in practice.
pdf
bib
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Anaelia Ovalle
|
Kai-Wei Chang
|
Ninareh Mehrabi
|
Yada Pruksachatkun
|
Aram Galystan
|
Jwala Dhamala
|
Apurv Verma
|
Trista Cao
|
Anoop Kumar
|
Rahul Gupta
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
pdf
bib
abs
ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation
Kuan-Hao Huang
|
Varun Iyer
|
I-Hung Hsu
|
Anoop Kumar
|
Kai-Wei Chang
|
Aram Galstyan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity – the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.
2022
pdf
bib
abs
Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations
Kuan-Hao Huang
|
Varun Iyer
|
Anoop Kumar
|
Sriram Venkatapathy
|
Kai-Wei Chang
|
Aram Galstyan
Findings of the Association for Computational Linguistics: EMNLP 2022
Syntactically controlled paraphrase generation has become an emerging research direction in recent years. Most existing approaches require annotated paraphrase pairs for training and are thus costly to extend to new domains. Unsupervised approaches, on the other hand, do not need paraphrase pairs but suffer from relatively poor performance in terms of syntactic control and quality of generated paraphrases. In this paper, we demonstrate that leveraging Abstract Meaning Representations (AMR) can greatly improve the performance of unsupervised syntactically controlled paraphrase generation.Our proposed model, AMR-enhanced Paraphrase Generator (AMRPG), separately encodes the AMR graph and the constituency parse of the input sentence into two disentangled semantic and syntactic embeddings. A decoder is then learned to reconstruct the input sentence from the semantic and syntactic embeddings. Our experiments show that AMRPG generates more accurate syntactically controlled paraphrases, both quantitatively and qualitatively, compared to the existing unsupervised approaches. We also demonstrate that the paraphrases generated by AMRPG can be used for data augmentation to improve the robustness of NLP models.
pdf
bib
abs
Temporal Generalization for Spoken Language Understanding
Judith Gaspers
|
Anoop Kumar
|
Greg Ver Steeg
|
Aram Galstyan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
Spoken Language Understanding (SLU) models in industry applications are usually trained offline on historic data, but have to perform well on incoming user requests after deployment. Since the application data is not available at training time, this is formally similar to the domain generalization problem, where domains correspond to different temporal segments of the data, and the goal is to build a model that performs well on unseen domains, e.g., upcoming data. In this paper, we explore different strategies for achieving good temporal generalization, including instance weighting, temporal fine-tuning, learning temporal features and building a temporally-invariant model. Our results on data of large-scale SLU systems show that temporal information can be leveraged to improve temporal generalization for SLU models.
pdf
bib
abs
Zero-Shot Cross-Lingual Sequence Tagging as Seq2Seq Generation for Joint Intent Classification and Slot Filling
Fei Wang
|
Kuan-hao Huang
|
Anoop Kumar
|
Aram Galstyan
|
Greg Ver steeg
|
Kai-wei Chang
Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)
The joint intent classification and slot filling task seeks to detect the intent of an utterance and extract its semantic concepts. In the zero-shot cross-lingual setting, a model is trained on a source language and then transferred to other target languages through multi-lingual representations without additional training data. While prior studies show that pre-trained multilingual sequence-to-sequence (Seq2Seq) models can facilitate zero-shot transfer, there is little understanding on how to design the output template for the joint prediction tasks. In this paper, we examine three aspects of the output template – (1) label mapping, (2) task dependency, and (3) word order. Experiments on the MASSIVE dataset consisting of 51 languages show that our output template significantly improves the performance of pre-trained cross-lingual language models.
2020
pdf
bib
abs
Evaluating the Effectiveness of Efficient Neural Architecture Search for Sentence-Pair Tasks
Ansel MacLaughlin
|
Jwala Dhamala
|
Anoop Kumar
|
Sriram Venkatapathy
|
Ragav Venkatesan
|
Rahul Gupta
Proceedings of the First Workshop on Insights from Negative Results in NLP
Neural Architecture Search (NAS) methods, which automatically learn entire neural model or individual neural cell architectures, have recently achieved competitive or state-of-the-art (SOTA) performance on variety of natural language processing and computer vision tasks, including language modeling, natural language inference, and image classification. In this work, we explore the applicability of a SOTA NAS algorithm, Efficient Neural Architecture Search (ENAS) (Pham et al., 2018) to two sentence pair tasks, paraphrase detection and semantic textual similarity. We use ENAS to perform a micro-level search and learn a task-optimized RNN cell architecture as a drop-in replacement for an LSTM. We explore the effectiveness of ENAS through experiments on three datasets (MRPC, SICK, STS-B), with two different models (ESIM, BiLSTM-Max), and two sets of embeddings (Glove, BERT). In contrast to prior work applying ENAS to NLP tasks, our results are mixed – we find that ENAS architectures sometimes, but not always, outperform LSTMs and perform similarly to random architecture search.