Yuchen Zhang


2024

pdf bib
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng | Hanbo Zhang | Jiani Zheng | Jiangnan Xia | Guoqiang Wei | Yang Wei | Yuchen Zhang | Tao Kong | Ruihua Song
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in processing image inputs and following open-ended instructions. Despite these advancements, there is considerable scope for enhancing open-source multi-modal LLMs, especially in terms of multi-modal understanding accuracy and instruction-following proficiency. In this paper, we conduct a comprehensive study on training GPT4-style models. We introduce Lynx a multi-modal LLM developed through a series of controlled experiments comparing various model variants. This process allowed us to identify and implement an optimal training strategy tailored for multi-modal LLMs. In addition to our model development, we propose a plug-and-play technique designed to augment the instruction-following capabilities of multi-modal LLMs. We have validated the performance of Lynx on multiple benchmarks. Results demonstrate that Lynx not only achieves strong image understanding accuracy but also excels in instruction-following tasks, paving the path for ongoing enhancements in multi-modal LLMs.

pdf bib
Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
Guanhua Huang | Yuchen Zhang | Zhe Li | Yongjian You | Mingze Wang | Zhouwang Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confident predictions under different noise, which improves the model’s robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5%-18.25% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at https://github.com/CarlanLark/Robust-AIGC-Detector.

2023

pdf bib
Rethinking Masked Language Modeling for Chinese Spelling Correction
Hongqiu Wu | Shaohua Zhang | Yuchen Zhang | Hai Zhao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. Through empirical analysis, we find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. Given that BERT is the backbone of most CSC models, this phenomenon has a significant negative impact. To address this issue, we are releasing a multi-domain benchmark LEMON, with higher quality and diversity than existing benchmarks, to allow a comprehensive assessment of the open domain generalization of CSC models. Then, we demonstrate that a very simple strategy – randomly masking 20% non-error tokens from the input sequence during fine-tuning – is sufficient for learning a much better language model without sacrificing the error model. This technique can be applied to any model architecture and achieves new state-of-the-art results on SIGHAN, ECSpell, and LEMON.

pdf bib
An Exploratory Study on Model Compression for Text-to-SQL
Shuo Sun | Yuze Gao | Yuchen Zhang | Jian Su | Bin Chen | Yingzhan Lin | Shuqi Sun
Findings of the Association for Computational Linguistics: ACL 2023

Text-to-SQL translates user queries into SQL statements that can retrieve relevant answers from relational databases. Recent approaches to Text-to-SQL rely on pre-trained language models that are computationally expensive and technically challenging to deploy in real-world applications that require real-time or on-device processing capabilities. In this paper, we perform a focused study on the feasibility of applying recent model compression techniques to sketch-based and sequence-to-sequence Text-to-SQL models. Our results reveal that sketch-based Text-to-SQL models generally have higher inference efficiency and respond better to model compression than sequence-to-sequence models, making them ideal for real-world deployments, especially in use cases with simple SQL statements.

pdf bib
Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT - A Text-to-SQL Parsing Comparison
Shuo Sun | Yuchen Zhang | Jiahuan Yan | Yuze Gao | Donovan Ong | Bin Chen | Jian Su
Findings of the Association for Computational Linguistics: EMNLP 2023

The success of ChatGPT has ignited an AI race, with researchers striving to develop new large language models (LLMs) that can match or surpass the language understanding and generation abilities of commercial ones. In recent times, a number of models have emerged, claiming performance near that of GPT-3.5 or GPT-4 through various instruction-tuning methods. As practitioners of Text-to-SQL parsing, we are grateful for their valuable contributions to open-source research. However, it is important to approach these claims with a sense of scrutiny and ascertain the actual effectiveness of these models. Therefore, we pit six popular large language models against each other, systematically evaluating their Text-to-SQL parsing capability on nine benchmark datasets with five different prompting strategies, covering both zero-shot and few-shot scenarios. Regrettably, the open-sourced models fell significantly short of the performance achieved by closed-source models like GPT-3.5, highlighting the need for further work to bridge the performance gap between these models.

pdf bib
Unsupervised Grammatical Error Correction Rivaling Supervised Methods
Hannan Cao | Liping Yuan | Yuchen Zhang | Hwee Tou Ng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

State-of-the-art grammatical error correction (GEC) systems rely on parallel training data (ungrammatical sentences and their manually corrected counterparts), which are expensive to construct. In this paper, we employ the Break-It-Fix-It (BIFI) method to build an unsupervised GEC system. The BIFI framework generates parallel data from unlabeled text using a fixer to transform ungrammatical sentences into grammatical ones, and a critic to predict sentence grammaticality. We present an unsupervised approach to build the fixer and the critic, and an algorithm that allows them to iteratively improve each other. We evaluate our unsupervised GEC system on English and Chinese GEC. Empirical results show that our GEC system outperforms previous unsupervised GEC systems, and achieves performance comparable to supervised GEC systems without ensemble. Furthermore, when combined with labeled training data, our system achieves new state-of-the-art results on the CoNLL-2014 and NLPCC-2018 test sets.

pdf bib
ROBBIE: Robust Bias Evaluation of Large Generative Language Models
David Esiobu | Xiaoqing Tan | Saghar Hosseini | Megan Ung | Yuchen Zhang | Jude Fernandes | Jane Dwivedi-Yu | Eleonora Presani | Adina Williams | Eric Smith
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

As generative large language models (LLMs) grow more performant and prevalent, we must develop comprehensive enough tools to measure and improve their fairness. Different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes, meaning that testing LLMs on more datasets can potentially help us characterize their biases more fully, and better ensure equal and equitable treatment of marginalized demographic groups. In this work, our focus is two-fold: (1) Benchmarking: a comparison of 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative LLMs. Out of those 6 metrics, AdvPromptSet and HolisticBiasR are novel datasets proposed in the paper. The comparison of those benchmarks gives us insights about the bias and toxicity of the compared models. Therefore, we explore the frequency of demographic terms in common LLM pre-training corpora and how this may relate to model biases. (2) Mitigation: we conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements. ROBBIE aims to provide insights for practitioners while deploying a model, emphasizing the need to not only measure potential harms, but also understand how they arise by characterizing the data, mitigate harms once found, and balance any trade-offs. We open-source our analysis code in hopes of encouraging broader measurements of bias in future LLMs.

pdf bib
Mixed-domain Language Modeling for Processing Long Legal Documents
Wenyue Hua | Yuchen Zhang | Zhe Chen | Josie Li | Melanie Weber
Proceedings of the Natural Legal Language Processing Workshop 2023

The application of Natural Language Processing (NLP) to specialized domains, such as the law, has recently received a surge of interest. As many legal services rely on processing and analyzing large collections of documents, automating such tasks with NLP tools such as language models emerges as a key challenge since legal documents may contain specialized vocabulary from other domains, such as medical terminology in personal injury text. However, most language models are general-purpose models, which either have limited reasoning capabilities on highly specialized legal terminology and syntax, such as BERT or ROBERTA, or are expensive to run and tune, such as GPT-3.5 and Claude. Thus, in this paper, we propose a specialized language model for personal injury text, LEGALRELECTRA, which is trained on mixed-domain legal and medical corpora. We show that as a small language model, our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text. Our training architecture implements the ELECTRA framework but utilizes REFORMER instead of BERT for its generator and discriminator. We show that this improves the model’s performance on processing long passages and results in better long-range text comprehension.

2021

pdf bib
Value-Agnostic Conversational Semantic Parsing
Emmanouil Antonios Platanios | Adam Pauls | Subhro Roy | Yuchen Zhang | Alexander Kyte | Alan Guo | Sam Thomson | Jayant Krishnamurthy | Jason Wolfe | Jacob Andreas | Dan Klein
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Conversational semantic parsers map user utterances to executable programs given dialogue histories composed of previous utterances, programs, and system responses. Existing parsers typically condition on rich representations of history that include the complete set of values and computations previously discussed. We propose a model that abstracts over values to focus prediction on type- and function-level context. This approach provides a compact encoding of dialogue histories and predicted programs, improving generalization and computational efficiency. Our model incorporates several other components, including an atomic span copy operation and structural enforcement of well-formedness constraints on predicted programs, that are particularly advantageous in the low-data regime. Trained on the SMCalFlow and TreeDST datasets, our model outperforms prior work by 7.3% and 10.6% respectively in terms of absolute accuracy. Trained on only a thousand examples from each dataset, it outperforms strong baselines by 12.4% and 6.4%. These results indicate that simple representations are key to effective generalization in conversational semantic parsing.

2020

pdf bib
Task-Oriented Dialogue as Dataflow Synthesis
Jacob Andreas | John Bufe | David Burkett | Charles Chen | Josh Clausman | Jean Crawford | Kate Crim | Jordan DeLoach | Leah Dorner | Jason Eisner | Hao Fang | Alan Guo | David Hall | Kristin Hayes | Kellie Hill | Diana Ho | Wendy Iwaszuk | Smriti Jha | Dan Klein | Jayant Krishnamurthy | Theo Lanman | Percy Liang | Christopher H. Lin | Ilya Lintsbakh | Andy McGovern | Aleksandr Nisnevich | Adam Pauls | Dmitrij Petters | Brent Read | Dan Roth | Subhro Roy | Jesse Rusak | Beth Short | Div Slomin | Ben Snyder | Stephon Striplin | Yu Su | Zachary Tellman | Sam Thomson | Andrei Vorobev | Izabela Witoszko | Jason Wolfe | Abby Wray | Yuchen Zhang | Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8

We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.

2019

pdf bib
Acquiring Structured Temporal Representation via Crowdsourcing: A Feasibility Study
Yuchen Zhang | Nianwen Xue
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Temporal Dependency Trees are a structured temporal representation that represents temporal relations among time expressions and events in a text as a dependency tree structure. Compared to traditional pair-wise temporal relation representations, temporal dependency trees facilitate efficient annotations, higher inter-annotator agreement, and efficient computations. However, annotations on temporal dependency trees so far have only been done by expert annotators, which is costly and time-consuming. In this paper, we introduce a method to crowdsource temporal dependency tree annotations, and show that this representation is intuitive and can be collected with high accuracy and agreement through crowdsourcing. We produce a corpus of temporal dependency trees, and present a baseline temporal dependency parser, trained and evaluated on this new corpus.

2018

pdf bib
Structured Interpretation of Temporal Relations
Yuchen Zhang | Nianwen Xue
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Neural Ranking Models for Temporal Dependency Structure Parsing
Yuchen Zhang | Nianwen Xue
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We design and build the first neural temporal dependency parser. It utilizes a neural ranking model with minimal feature engineering, and parses time expressions and events in a text into a temporal dependency tree structure. We evaluate our parser on two domains: news reports and narrative stories. In a parsing-only evaluation setup where gold time expressions and events are provided, our parser reaches 0.81 and 0.70 f-score on unlabeled and labeled parsing respectively, a result that is very competitive against alternative approaches. In an end-to-end evaluation setup where time expressions and events are automatically recognized, our parser beats two strong baselines on both data domains. Our experimental results and discussions shed light on the nature of temporal dependency structures in different domains and provide insights that we believe will be valuable to future research in this area.

2017

pdf bib
Macro Grammars and Holistic Triggering for Efficient Semantic Parsing
Yuchen Zhang | Panupong Pasupat | Percy Liang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

To learn a semantic parser from denotations, a learning algorithm must search over a combinatorially large space of logical forms for ones consistent with the annotated denotations. We propose a new online learning algorithm that searches faster as training progresses. The two key ideas are using macro grammars to cache the abstract patterns of useful logical forms found thus far, and holistic triggering to efficiently retrieve the most relevant patterns based on sentence similarity. On the WikiTableQuestions dataset, we first expand the search space of an existing model to improve the state-of-the-art accuracy from 38.7% to 42.7%, and then use macro grammars and holistic triggering to achieve an 11x speedup and an accuracy of 43.7%.

2014

pdf bib
Buy one get one free: Distant annotation of Chinese tense, event type and modality
Nianwen Xue | Yuchen Zhang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe a “distant annotation” method where we mark up the semantic tense, event type, and modality of Chinese events via a word-aligned parallel corpus. We first map Chinese verbs to their English counterparts via word alignment, and then annotate the resulting English text spans with coarse-grained categories for semantic tense, event type, and modality that we believe apply to both English and Chinese. Because English has richer morpho-syntactic indicators for semantic tense, event type and modality than Chinese, our intuition is that this distant annotation approach will yield more consistent annotation than if we annotate the Chinese side directly. We report experimental results that show stable annotation agreement statistics and that event type and modality have significant influence on tense prediction. We also report the size of the annotated corpus that we have obtained, and how different domains impact annotation consistency.

pdf bib
Automatic Inference of the Tense of Chinese Events Using Implicit Linguistic Information
Yuchen Zhang | Nianwen Xue
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Distant annotation of Chinese tense and modality
Nianwen Xue | Yuchen Zhang | Yaqin Yang
Proceedings of the IWCS 2013 Workshop on Annotation of Modal Meanings in Natural Language (WAMM)

pdf bib
Towards Robust Linguistic Analysis using OntoNotes
Sameer Pradhan | Alessandro Moschitti | Nianwen Xue | Hwee Tou Ng | Anders Björkelund | Olga Uryupina | Yuchen Zhang | Zhi Zhong
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

2012

pdf bib
CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes
Sameer Pradhan | Alessandro Moschitti | Nianwen Xue | Olga Uryupina | Yuchen Zhang
Joint Conference on EMNLP and CoNLL - Shared Task