Jesse Dodge


2024

pdf bib
Language Models Hallucinate, but May Excel at Fact Verification
Jian Guan | Jesse Dodge | David Wadden | Minlie Huang | Hao Peng
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently “hallucinate,” resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B , the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.

pdf bib
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
Li Lucy | Suchin Gururangan | Luca Soldaini | Emma Strubell | David Bamman | Lauren Klein | Jesse Dodge
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models’ (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten “quality” and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

pdf bib
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini | Rodney Kinney | Akshita Bhagia | Dustin Schwenk | David Atkinson | Russell Authur | Ben Bogin | Khyathi Chandu | Jennifer Dumas | Yanai Elazar | Valentin Hofmann | Ananya Jha | Sachin Kumar | Li Lucy | Xinxi Lyu | Nathan Lambert | Ian Magnusson | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Abhilasha Ravichander | Kyle Richardson | Zejiang Shen | Emma Strubell | Nishant Subramani | Oyvind Tafjord | Evan Walsh | Luke Zettlemoyer | Noah Smith | Hannaneh Hajishirzi | Iz Beltagy | Dirk Groeneveld | Jesse Dodge | Kyle Lo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

pdf bib
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld | Iz Beltagy | Evan Walsh | Akshita Bhagia | Rodney Kinney | Oyvind Tafjord | Ananya Jha | Hamish Ivison | Ian Magnusson | Yizhong Wang | Shane Arora | David Atkinson | Russell Authur | Khyathi Chandu | Arman Cohan | Jennifer Dumas | Yanai Elazar | Yuling Gu | Jack Hessel | Tushar Khot | William Merrill | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Valentina Pyatkin | Abhilasha Ravichander | Dustin Schwenk | Saurabh Shah | William Smith | Emma Strubell | Nishant Subramani | Mitchell Wortsman | Pradeep Dasigi | Nathan Lambert | Kyle Richardson | Luke Zettlemoyer | Jesse Dodge | Kyle Lo | Luca Soldaini | Noah Smith | Hannaneh Hajishirzi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.

2023

pdf bib
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou | Matthew Peters | Alexander Fraser | Jesse Dodge
Findings of the Association for Computational Linguistics: EACL 2023

Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.

pdf bib
Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications
Li Lucy | Jesse Dodge | David Bamman | Katherine Keith
Findings of the Association for Computational Linguistics: ACL 2023

Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hinder understanding for out-groups. In this work, we develop and validate an interpretable approach for measuring scholarly jargon from text. Expanding the scope of prior work which focuses on word types, we use word sense induction to also identify words that are widespread but overloaded with different meanings across fields. We then estimate the prevalence of these discipline-specific words and senses across hundreds of subfields, and show that word senses provide a complementary, yet unique view of jargon alongside word types. We demonstrate the utility of our metrics for science of science and computational sociolinguistics by highlighting two key social implications. First, though most fields reduce their use of jargon when writing for general-purpose venues, and some fields (e.g., biological sciences) do so less than others. Second, the direction of correlation between jargon and citation rates varies among fields, but jargon is nearly always negatively correlated with interdisciplinary impact. Broadly, our findings suggest that though multidisciplinary venues intend to cater to more general audiences, some fields’ writing norms may act as barriers rather than bridges, and thus impede the dispersion of scholarly ideas.

pdf bib
Stubborn Lexical Bias in Data and Models
Sofia Serrano | Jesse Dodge | Noah A. Smith
Findings of the Association for Computational Linguistics: ACL 2023

In NLP, recent work has seen increased focus on spurious correlations between various features and labels in training data, and how these influence model behavior. However, the presence and effect of such correlations are typically examined feature by feature. We investigate the cumulative impact on a model of many such intersecting features. Using a new statistical method, we examine whether such spurious patterns in data appear in models trained on the data. We select two tasks— natural language inference and duplicate-question detection— for which any unigram feature on its own should ideally be uninformative, which gives us a large pool of automatically extracted features with which to experiment. The large size of this pool allows us to investigate the intersection of features spuriously associated with (potentially different) labels. We then apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations, and examine how doing so affects models trained on the reweighted data. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models, including worsened bias for slightly more complex features (bigrams). We close with discussion about the implications of our results on what it means to “debias” training data, and how issues of data quality can affect model bias.

pdf bib
Reproducibility in NLP: What Have We Learned from the Checklist?
Ian Magnusson | Noah A. Smith | Jesse Dodge
Findings of the Association for Computational Linguistics: ACL 2023

Scientific progress in NLP rests on the reproducibility of researchers’ claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. We provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, we find evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist’s introduction. Further, we show acceptance rate grows for submissions with more Yes responses. We find that the 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. We find that only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. We discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.

pdf bib
Efficient Methods for Natural Language Processing: A Survey
Marcos Treviso | Ji-Ung Lee | Tianchu Ji | Betty van Aken | Qingqing Cao | Manuel R. Ciosici | Michael Hassid | Kenneth Heafield | Sara Hooker | Colin Raffel | Pedro H. Martins | André F. T. Martins | Jessica Zosa Forde | Peter Milder | Edwin Simpson | Noam Slonim | Jesse Dodge | Emma Strubell | Niranjan Balasubramanian | Leon Derczynski | Iryna Gurevych | Roy Schwartz
Transactions of the Association for Computational Linguistics, Volume 11

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

pdf bib
Detecting Personal Information in Training Corpora: an Analysis
Nishant Subramani | Sasha Luccioni | Jesse Dodge | Margaret Mitchell
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively under-explored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.

2022

pdf bib
Towards Reproducible Machine Learning Research in Natural Language Processing
Ana Lucic | Maurits Bleeker | Samarth Bhargav | Jessica Forde | Koustuv Sinha | Jesse Dodge | Sasha Luccioni | Robert Stojnic
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

While recent progress in the field of ML has been significant, the reproducibility of these cutting-edge results is often lacking, with many submissions lacking the necessary information in order to ensure subsequent reproducibility. Despite proposals such as the Reproducibility Checklist and reproducibility criteria at several major conferences, the reflex for carrying out research with reproducibility in mind is lacking in the broader ML community. We propose this tutorial as a gentle introduction to ensuring reproducible research in ML, with a specific emphasis on computational linguistics and NLP. We also provide a framework for using reproducibility as a teaching tool in university-level computer science programs.

pdf bib
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
Alexandra Chronopoulou | Matthew Peters | Jesse Dodge
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The remarkable success of large language models has been driven by dense models trained on massive unlabeled, unstructured corpora. These corpora typically contain text from diverse, heterogeneous sources, but information about the source of the text is rarely used during training. Transferring their knowledge to a target domain is typically done by continuing training in-domain. In this paper, we introduce a method to permit domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.

pdf bib
Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages
David Adelani | Md Mahfuz Ibn Alam | Antonios Anastasopoulos | Akshita Bhagia | Marta R. Costa-jussà | Jesse Dodge | Fahim Faisal | Christian Federmann | Natalia Fedorova | Francisco Guzmán | Sergey Koshelev | Jean Maillard | Vukosi Marivate | Jonathan Mbuya | Alexandre Mourachko | Safiyyah Saleem | Holger Schwenk | Guillaume Wenzek
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the results of the WMT’22 SharedTask on Large-Scale Machine Translation Evaluation for African Languages. The shared taskincluded both a data and a systems track, alongwith additional innovations, such as a focus onAfrican languages and extensive human evaluation of submitted systems. We received 14system submissions from 8 teams, as well as6 data track contributions. We report a largeprogress in the quality of translation for Africanlanguages since the last iteration of this sharedtask: there is an increase of about 7.5 BLEUpoints across 72 language pairs, and the average BLEU scores went from 15.09 to 22.60.

2021

pdf bib
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge | Maarten Sap | Ana Marasović | William Agnew | Gabriel Ilharco | Dirk Groeneveld | Margaret Mitchell | Matt Gardner
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.

pdf bib
Competency Problems: On Finding and Removing Artifacts in Language Data
Matt Gardner | William Merrill | Jesse Dodge | Matthew Peters | Alexis Ross | Sameer Singh | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have “spurious” instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word “amazing” on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.

pdf bib
Expected Validation Performance and Estimation of a Random Variable’s Maximum
Jesse Dodge | Suchin Gururangan | Dallas Card | Roy Schwartz | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2021

Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.

2020

pdf bib
The Right Tool for the Job: Matching Model and Instance Complexities
Roy Schwartz | Gabriel Stanovsky | Swabha Swayamdipta | Jesse Dodge | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

2019

pdf bib
RNN Architecture Learning with Sparse Regularization
Jesse Dodge | Roy Schwartz | Hao Peng | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Neural models for NLP typically use large numbers of parameters to reach state-of-the-art performance, which can lead to excessive memory usage and increased runtime. We present a structure learning method for learning sparse, parameter-efficient NLP models. Our method applies group lasso to rational RNNs (Peng et al., 2018), a family of models that is closely connected to weighted finite-state automata (WFSAs). We take advantage of rational RNNs’ natural grouping of the weights, so the group lasso penalty directly removes WFSA states, substantially reducing the number of parameters in the model. Our experiments on a number of sentiment analysis datasets, using both GloVe and BERT embeddings, show that our approach learns neural structures which have fewer parameters without sacrificing performance relative to parameter-rich baselines. Our method also highlights the interpretable properties of rational RNNs. We show that sparsifying such models makes them easier to visualize, and we present models that rely exclusively on as few as three WFSAs after pruning more than 90% of the weights. We publicly release our code.

pdf bib
Show Your Work: Improved Reporting of Experimental Results
Jesse Dodge | Suchin Gururangan | Dallas Card | Roy Schwartz | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

2016

pdf bib
Key-Value Memory Networks for Directly Reading Documents
Alexander Miller | Adam Fisch | Jesse Dodge | Amir-Hossein Karimi | Antoine Bordes | Jason Weston
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Retrofitting Word Vectors to Semantic Lexicons
Manaal Faruqui | Jesse Dodge | Sujay Kumar Jauhar | Chris Dyer | Eduard Hovy | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Context-dependent Semantic Parsing for Time Expressions
Kenton Lee | Yoav Artzi | Jesse Dodge | Luke Zettlemoyer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
CMU: Arc-Factored, Discriminative Semantic Dependency Parsing
Sam Thomson | Brendan O’Connor | Jeffrey Flanigan | David Bamman | Jesse Dodge | Swabha Swayamdipta | Nathan Schneider | Chris Dyer | Noah A. Smith
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2012

pdf bib
Detecting Visual Text
Jesse Dodge | Amit Goyal | Xufeng Han | Alyssa Mensch | Margaret Mitchell | Karl Stratos | Kota Yamaguchi | Yejin Choi | Hal Daumé III | Alex Berg | Tamara Berg
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Midge: Generating Image Descriptions From Computer Vision Detections
Margaret Mitchell | Jesse Dodge | Amit Goyal | Kota Yamaguchi | Karl Stratos | Xufeng Han | Alyssa Mensch | Alex Berg | Tamara Berg | Hal Daumé III
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Search
Co-authors