Annie Louis


2024

pdf bib
Little Red Riding Hood Goes around the Globe: Crosslingual Story Planning and Generation with Large Language Models
Evgeniia Razumovskaia | Joshua Maynez | Annie Louis | Mirella Lapata | Shashi Narayan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Previous work has demonstrated the effectiveness of planning for story generation exclusively in a monolingual setting focusing primarily on English. We consider whether planning brings advantages to automatic story generation across languages. We propose a new task of crosslingual story generation with planning and present a new dataset for this task. We conduct a comprehensive study of different plans and generate stories in several languages, by leveraging the creative and reasoning capabilities of large pretrained language models. Our results demonstrate that plans which structure stories into three acts lead to more coherent and interesting narratives, while allowing to explicitly control their content and structure.

pdf bib
A synthetic data approach for domain generalization of NLI models
Mohammad Javad Hosseini | Andrey Petrov | Alex Fabrikant | Annie Louis
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data (685K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around 7% on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.

2023

pdf bib
OpineSum: Entailment-based self-training for abstractive opinion summarization
Annie Louis | Joshua Maynez
Findings of the Association for Computational Linguistics: ACL 2023

A typical product or place often has hundreds of reviews, and summarization of these texts is an important and challenging problem. Recent progress on abstractive summarization in domains such as news has been driven by supervised systems trained on hundreds of thousands of news articles paired with human-written summaries. However for opinion texts, such large scale datasets are rarely available. Unsupervised methods, self-training, and few-shot learning approaches bridge that gap. In this work, we present a novel self-training approach, OpineSum for abstractive opinion summarization. The self-training summaries in this approach are built automatically using a novel application of textual entailment and capture the consensus of opinions across the various reviews for an item. This method can be used to obtain silver-standard summaries on a large scale and train both unsupervised and few-shot abstractive summarization systems. OpineSum outperforms strong peer systems in both settings.

pdf bib
Conditional Generation with a Question-Answering Blueprint
Shashi Narayan | Joshua Maynez | Reinald Kim Amplayo | Kuzman Ganchev | Annie Louis | Fantine Huot | Anders Sandholm | Dipanjan Das | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 11

The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. We propose a new conceptualization of text plans as a sequence of question-answer (QA) pairs and enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for content selection (i.e., what to say) and planning (i.e., in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.

pdf bib
LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction
Jeremiah Milbauer | Annie Louis | Mohammad Javad Hosseini | Alex Fabrikant | Donald Metzler | Tal Schuster
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be seen as a sequence of related segments (e.g., the sequence of sentences within a passage, or the hypothesis and premise in NLI). While attending across these segments is highly beneficial for many tasks, we hypothesize that this interaction can be delayed until later encoding stages. To this end, we introduce Layer-Adjustable Interactions in Transformers (LAIT). Within LAIT, segmented inputs are first encoded independently, and then jointly. This partial two-tower architecture bridges the gap between a Dual Encoder’s ability to pre-compute representations for segments and a fully self-attentive Transformer’s capacity to model cross-segment attention. The LAIT framework effectively leverages existing pretrained Transformers and converts them into the hybrid of the two aforementioned architectures, allowing for easy and intuitive control over the performance-efficiency tradeoff. Experimenting on a wide range of NLP tasks, we find LAIT able to reduce 30-50% of the attention FLOPs on many tasks, while preserving high accuracy; in some practical settings, LAIT could reduce actual latency by orders of magnitude.

pdf bib
Resolving Indirect Referring Expressions for Entity Selection
Mohammad Javad Hosseini | Filip Radlinski | Silvia Pareti | Annie Louis
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in language modeling have enabled new conversational systems. In particular, it is often desirable for people to make choices among specified options when using such systems. We address the problem of reference resolution, when people use natural expressions to choose between real world entities. For example, given the choice ‘Should we make a Simnel cake or a Pandan cake¿ a natural response from a non-expert may be indirect: ‘let’s make the green one‘. Reference resolution has been little studied with natural expressions, thus robustly understanding such language has large potential for improving naturalness in dialog, recommendation, and search systems. We create AltEntities (Alternative Entities), a new public dataset of entity pairs and utterances, and develop models for the disambiguation problem. Consisting of 42K indirect referring expressions across three domains, it enables for the first time the study of how large language models can be adapted to this task. We find they achieve 82%-87% accuracy in realistic settings, which while reasonable also invites further advances.

2022

pdf bib
Source-summary Entity Aggregation in Abstractive Summarization
José Ángel González | Annie Louis | Jackie Chi Kit Cheung
Proceedings of the 29th International Conference on Computational Linguistics

In a text, entities mentioned earlier can be referred to in later discourse by a more general description. For example, Celine Dion and Justin Bieber can be referred to by Canadian singers or celebrities. In this work, we study this phenomenon in the context of summarization, where entities from a source text are generalized in the summary. We call such instances source-summary entity aggregations. We categorize these aggregations into two types and analyze them in the Cnn/Dailymail corpus, showing that they are reasonably frequent. We then examine how well three state-of-the-art summarization systems can generate such aggregations within summaries. We also develop techniques to encourage them to generate more aggregations. Our results show that there is significant room for improvement in producing semantically correct aggregations.

2021

pdf bib
Proceedings of the 2nd Workshop on Computational Approaches to Discourse
Chloé Braud | Christian Hardmeier | Junyi Jessy Li | Annie Louis | Michael Strube | Amir Zeldes
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

2020

pdf bib
Proceedings of the First Workshop on Computational Approaches to Discourse
Chloé Braud | Christian Hardmeier | Junyi Jessy Li | Annie Louis | Michael Strube
Proceedings of the First Workshop on Computational Approaches to Discourse

pdf bib
I’d rather just go to bed”: Understanding Indirect Answers
Annie Louis | Dan Roth | Filip Radlinski
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We revisit a pragmatic inference problem in dialog: Understanding indirect responses to questions. Humans can interpret ‘I’m starving.’ in response to ‘Hungry?’, even without direct cue words such as ‘yes’ and ‘no’. In dialog systems, allowing natural responses rather than closed vocabularies would be similarly beneficial. However, today’s systems are only as sensitive to these pragmatic moves as their language model allows. We create and release the first large-scale English language corpus ‘Circa’ with 34,268 (polar question, indirect answer) pairs to enable progress on this task. The data was collected via elaborate crowdsourcing, and contains utterances with yes/no meaning, as well as uncertain, middle-ground, and conditional responses. We also present BERT-based neural models to predict such categories for a question-answer pair. We find that while transfer learning from entailment works reasonably, performance is not yet sufficient for robust dialog. Our models reach 82-88% accuracy for a 4-class distinction, and 74-85% for 6 classes.

pdf bib
TESA: A Task in Entity Semantic Aggregation for Abstractive Summarization
Clément Jumel | Annie Louis | Jackie Chi Kit Cheung
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Human-written texts contain frequent generalizations and semantic aggregation of content. In a document, they may refer to a pair of named entities such as ‘London’ and ‘Paris’ with different expressions: “the major cities”, “the capital cities” and “two European cities”. Yet generation, especially, abstractive summarization systems have so far focused heavily on paraphrasing and simplifying the source content, to the exclusion of such semantic abstraction capabilities. In this paper, we present a new dataset and task aimed at the semantic aggregation of entities. TESA contains a dataset of 5.3K crowd-sourced entity aggregations of Person, Organization, and Location named entities. The aggregations are document-appropriate, meaning that they are produced by annotators to match the situational context of a given news article from the New York Times. We then build baseline models for generating aggregations given a tuple of entities and document context. We finetune on TESA an encoder-decoder language model and compare it with simpler classification methods based on linguistically informed features. Our quantitative and qualitative evaluations show reasonable performance in making a choice from a given list of expressions, but free-form expressions are understandably harder to generate and evaluate.

2019

pdf bib
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)
Waleed Ammar | Annie Louis | Nasrin Mostafazadeh
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

pdf bib
Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses
Matt Grenander | Yue Dong | Jackie Chi Kit Cheung | Annie Louis
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Sentence position is a strong feature for news summarization, since the lead often (but not always) summarizes the key points of the article. In this paper, we show that recent neural systems excessively exploit this trend, which although powerful for many inputs, is also detrimental when summarizing documents where important content should be extracted from later parts of the article. We propose two techniques to make systems sensitive to the importance of content in different parts of the article. The first technique employs ‘unbiased’ data; i.e., randomly shuffled sentences of the source document, to pretrain the model. The second technique uses an auxiliary ROUGE-based loss that encourages the model to distribute importance scores throughout a document by mimicking sentence-level ROUGE scores on the training data. We show that these techniques significantly improve the performance of a competitive reinforcement learning based extractive system, with the auxiliary loss being more powerful than pretraining.

2018

pdf bib
Deep Dungeons and Dragons: Learning Character-Action Interactions from Role-Playing Game Transcripts
Annie Louis | Charles Sutton
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

An essential aspect to understanding narratives is to grasp the interaction between characters in a story and the actions they take. We examine whether computational models can capture this interaction, when both character attributes and actions are expressed as complex natural language descriptions. We propose role-playing games as a testbed for this problem, and introduce a large corpus of game transcripts collected from online discussion forums. Using neural language models which combine character and action descriptions from these stories, we show that we can learn the latent ties. Action sequences are better predicted when the character performing the action is also taken into account, and vice versa for character attributes.

pdf bib
Getting to “Hearer-old”: Charting Referring Expressions Across Time
Ieva Staliūnaitė | Hannah Rohde | Bonnie Webber | Annie Louis
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

When a reader is first introduced to an entity, its referring expression must describe the entity. For entities that are widely known, a single word or phrase often suffices. This paper presents the first study of how expressions that refer to the same entity develop over time. We track thousands of person and organization entities over 20 years of New York Times (NYT). As entities move from hearer-new (first introduction to the NYT audience) to hearer-old (common knowledge) status, we show empirically that the referring expressions along this trajectory depend on the type of the entity, and exhibit linguistic properties related to becoming common knowledge (e.g., shorter length, less use of appositives, more definiteness). These properties can also be used to build a model to predict how long it will take for an entity to reach hearer-old status. Our results reach 10-30% absolute improvement over a majority-class baseline.

2017

pdf bib
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Nasrin Mostafazadeh | Nathanael Chambers | Annie Louis
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
LSDSem 2017 Shared Task: The Story Cloze Test
Nasrin Mostafazadeh | Michael Roth | Annie Louis | Nathanael Chambers | James Allen
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

The LSDSem’17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending to the story. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge. A total of eight systems participated in the shared task, with a variety of approaches including.

pdf bib
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue
Kristiina Jokinen | Manfred Stede | David DeVault | Annie Louis
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

pdf bib
Exploring Substitutability through Discourse Adverbials and Multiple Judgments
Hannah Rohde | Anna Dickinson | Nathan Schneider | Annie Louis | Bonnie Webber
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Long papers

2016

pdf bib
Filling in the Blanks in Understanding Discourse Adverbials: Consistency, Conflict, and Context-Dependence in a Crowdsourced Elicitation Task
Hannah Rohde | Anna Dickinson | Nathan Schneider | Christopher N. L. Clark | Annie Louis | Bonnie Webber
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods
Annie Louis | Michael Roth | Bonnie Webber | Michael White | Luke Zettlemoyer
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf bib
Book Reviews: Natural Language Processing for Social Media by Atefeh Farzindar and Diana Inkpen
Annie Louis
Computational Linguistics, Volume 42, Issue 4 - December 2016

2015

pdf bib
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Annie Louis | Bonnie Webber | Tim Baldwin
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
Recovering discourse relations: Varying influence of discourse adverbials
Hannah Rohde | Anna Dickinson | Chris Clark | Annie Louis | Bonnie Webber
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
Which Step Do I Take First? Troubleshooting with Bayesian Models
Annie Louis | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 3

Online discussion forums and community question-answering websites provide one of the primary avenues for online users to share information. In this paper, we propose text mining techniques which aid users navigate troubleshooting-oriented data such as questions asked on forums and their suggested solutions. We introduce Bayesian generative models of the troubleshooting data and apply them to two interrelated tasks: (a) predicting the complexity of the solutions (e.g., plugging a keyboard in the computer is easier compared to installing a special driver) and (b) presenting them in a ranked order from least to most complex. Experimental results show that our models are on par with human performance on these tasks, while outperforming baselines based on solution length or readability.

pdf bib
Conversation Trees: A Grammar Model for Topic Structure in Forums
Annie Louis | Shay B. Cohen
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
A Bayesian Method to Incorporate Background Knowledge during Automatic Text Summarization
Annie Louis
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the ACL 2014 Student Research Workshop
Ekaterina Kochmar | Annie Louis | Svitlana Volkova | Jordan Boyd-Graber | Bill Byrne
Proceedings of the ACL 2014 Student Research Workshop

pdf bib
Structured and Unstructured Cache Models for SMT Domain Adaptation
Annie Louis | Bonnie Webber
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under Length Constraints
Annie Louis | Ani Nenkova
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Proceedings of the 2013 NAACL HLT Student Research Workshop
Annie Louis | Richard Socher | Julia Hockenmaier | Eric K. Ringger
Proceedings of the 2013 NAACL HLT Student Research Workshop

pdf bib
What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain
Annie Louis | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 1

Great writing is rare and highly admired. Readers seek out articles that are beautifully written, informative and entertaining. Yet information-access technologies lack capabilities for predicting article quality at this level. In this paper we present first experiments on article quality prediction in the science journalism domain. We introduce a corpus of great pieces of science journalism, along with typical articles from the genre. We implement features to capture aspects of great writing, including surprising, visual and emotional content, as well as general features related to discourse organization and sentence structure. We show that the distinction between great and typical articles can be detected fairly accurately, and that the entire spectrum of our features contribute to the distinction.

pdf bib
Automatically Assessing Machine Summary Content Without a Gold Standard
Annie Louis | Ani Nenkova
Computational Linguistics, Volume 39, Issue 2 - June 2013

2012

pdf bib
Automatic Metrics for Genre-specific Text Quality
Annie Louis
Proceedings of the NAACL HLT 2012 Student Research Workshop

pdf bib
A Coherence Model Based on Syntactic Patterns
Annie Louis | Ani Nenkova
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Summarization of Business-Related Tweets: A Concept-Based Approach
Annie Louis | Todd Newman
Proceedings of COLING 2012: Posters

pdf bib
A corpus of general and specific sentences from news
Annie Louis | Ani Nenkova
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a corpus of sentences from news articles that are annotated as general or specific. We employed annotators on Amazon Mechanical Turk to mark sentences from three kinds of news articles―reports on events, finance news and science journalism. We introduce the resulting corpus, with focus on annotator agreement, proportion of general/specific sentences in the articles and results for automatic classification of the two sentence types.

2011

pdf bib
Automatic identification of general and specific sentences by leveraging discourse annotations
Annie Louis | Ani Nenkova
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Text Specificity and Impact on Quality of News Summaries
Annie Louis | Ani Nenkova
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2010

pdf bib
Off-topic essay detection using short prompt texts
Annie Louis | Derrick Higgins
Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Using entity features to classify implicit discourse relations
Annie Louis | Aravind Joshi | Rashmi Prasad | Ani Nenkova
Proceedings of the SIGDIAL 2010 Conference

pdf bib
Discourse indicators for content selection in summarization
Annie Louis | Aravind Joshi | Ani Nenkova
Proceedings of the SIGDIAL 2010 Conference

pdf bib
Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Emily Pitler | Annie Louis | Ani Nenkova
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Creating Local Coherence: An Empirical Assessment
Annie Louis | Ani Nenkova
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf bib
Performance Confidence Estimation for Automatic Summarization
Annie Louis | Ani Nenkova
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Automatic sense prediction for implicit discourse relations in text
Emily Pitler | Annie Louis | Ani Nenkova
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Automatically Evaluating Content Selection in Summarization without Human Models
Annie Louis | Ani Nenkova
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization
Ani Nenkova | Annie Louis
Proceedings of ACL-08: HLT