2023
pdf
bib
abs
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Griffin Adams
|
Alex Fabbri
|
Faisal Ladhak
|
Noémie Elhadad
|
Kathleen McKeown
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Two-step approaches, in which summary candidates are generated-then-reranked to return a single summary, can improve ROUGE scores over the standard single-step approach. Yet, standard decoding methods (i.e., beam search, nucleus sampling, and diverse beam search) produce candidates with redundant, and often low quality, content. In this paper, we design a novel method to generate candidates for re-ranking that addresses these issues. We ground each candidate abstract on its own unique content plan and generate distinct plan-guided abstracts using a model’s top beam. More concretely, a standard language model (a BART LM) auto-regressively generates elemental discourse unit (EDU) content plans with an extractive copy mechanism. The top K beams from the content plan generator are then used to guide a separate LM, which produces a single abstractive candidate for each distinct plan. We apply an existing re-ranker (BRIO) to abstractive candidates generated from our method, as well as baseline decoding methods. We show large relevance improvements over previously published methods on widely used single document news article corpora, with ROUGE-2 F1 gains of 0.88, 2.01, and 0.38 on CNN / Dailymail, NYT, and Xsum, respectively. A human evaluation on CNN / DM validates these results. Similarly, on 1k samples from CNN / DM, we show that prompting GPT-3 to follow EDU plans outperforms sampling-based methods by by 1.05 ROUGE-2 F1 points. Code to generate and realize plans is available at
https://github.com/griff4692/edu-sum.
pdf
bib
abs
What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
Griffin Adams
|
Bichlien Nguyen
|
Jake Smith
|
Yingce Xia
|
Shufang Xie
|
Anna Ostropolets
|
Budhaditya Deb
|
Yuan-Jyue Chen
|
Tristan Naumann
|
Noémie Elhadad
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise–the disagreement between model and metric defined candidate rankings–minimized.
pdf
bib
abs
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Griffin Adams
|
Alex Fabbri
|
Faisal Ladhak
|
Eric Lehman
|
Noémie Elhadad
Proceedings of the 4th New Frontiers in Summarization Workshop
Selecting the “right” amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstractive, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generated by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt and almost as dense as human written summaries. Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability. 500 annotated CoD summaries, as well as an extra 5,000 unannotated summaries, are freely available on HuggingFace (https://huggingface.co/datasets/griffin/chain_of_density).
2022
pdf
bib
abs
Learning to Revise References for Faithful Summarization
Griffin Adams
|
Han-Chin Shing
|
Qing Sun
|
Christopher Winestock
|
Kathleen McKeown
|
Noémie Elhadad
Findings of the Association for Computational Linguistics: EMNLP 2022
In real-world scenarios with naturally occurring datasets, reference summaries are noisy and may contain information that cannot be inferred from the source text. On large news corpora, removing low quality samples has been shown to reduce model hallucinations. Yet, for smaller, and/or noisier corpora, filtering is detrimental to performance. To improve reference quality while retaining all data, we propose a new approach: to selectively re-write unsupported reference sentences to better reflect source data. We automatically generate a synthetic dataset of positive and negative revisions by corrupting supported sentences and learn to revise reference sentences with contrastive learning. The intensity of revisions is treated as a controllable attribute so that, at inference, diverse candidates can be over-generated-then-rescored to balance faithfulness and abstraction. To test our methods, we extract noisy references from publicly available MIMIC-III discharge summaries for the task of hospital-course summarization, and vary the data on which models are trained. According to metrics and human evaluation, models trained on revised clinical references are much more faithful, informative, and fluent than models trained on original or filtered data.
2021
pdf
bib
abs
What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization
Griffin Adams
|
Emily Alsentzer
|
Mert Ketenci
|
Jason Zucker
|
Noémie Elhadad
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Summarization of clinical narratives is a long-standing research problem. Here, we introduce the task of hospital-course summarization. Given the documentation authored throughout a patient’s hospitalization, generate a paragraph that tells the story of the patient admission. We construct an English, text-to-text dataset of 109,000 hospitalizations (2M source notes) and their corresponding summary proxy: the clinician-authored “Brief Hospital Course” paragraph written as part of a discharge note. Exploratory analyses reveal that the BHC paragraphs are highly abstractive with some long extracted fragments; are concise yet comprehensive; differ in style and content organization from the source notes; exhibit minimal lexical cohesion; and represent silver-standard references. Our analysis identifies multiple implications for modeling this complex, multi-document summarization task.
2015
pdf
bib
SemEval-2015 Task 14: Analysis of Clinical Text
Noémie Elhadad
|
Sameer Pradhan
|
Sharon Gorman
|
Suresh Manandhar
|
Wendy Chapman
|
Guergana Savova
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
pdf
bib
A convex and feature-rich discriminative approach to dependency grammar induction
Édouard Grave
|
Noémie Elhadad
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
2014
pdf
bib
Terminology Questions in Texts Authored by Patients
Noemie Elhadad
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
pdf
bib
SemEval-2014 Task 7: Analysis of Clinical Text
Sameer Pradhan
|
Noémie Elhadad
|
Wendy Chapman
|
Suresh Manandhar
|
Guergana Savova
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
pdf
bib
Cross-narrative Temporal Ordering of Medical Events
Preethi Raghavan
|
Eric Fosler-Lussier
|
Noémie Elhadad
|
Albert M. Lai
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2011
pdf
bib
Putting it Simply: a Context-Aware Approach to Lexical Simplification
Or Biran
|
Samuel Brody
|
Noémie Elhadad
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
2010
pdf
bib
Cancer Stage Prediction Based on Patient Online Discourse
Mukund Jha
|
Noémie Elhadad
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
pdf
bib
A Comparison of Features for Automatic Readability Assessment
Lijun Feng
|
Martin Jansche
|
Matt Huenerfauth
|
Noémie Elhadad
Coling 2010: Posters
pdf
bib
An Unsupervised Aspect-Sentiment Model for Online Reviews
Samuel Brody
|
Noemie Elhadad
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
2009
pdf
bib
Cognitively Motivated Features for Readability Assessment
Lijun Feng
|
Noémie Elhadad
|
Matt Huenerfauth
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
2007
pdf
bib
Mining a Lexicon of Technical Terms and Lay Equivalents
Noemie Elhadad
|
Komal Sutaria
Biological, translational, and clinical language processing
2003
pdf
bib
Sentence Alignment for Monolingual Comparable Corpora
Regina Barzilay
|
Noemie Elhadad
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing
2002
pdf
bib
Collection and linguistic processing of a large-scale corpus of medical articles
Simone Teufel
|
Noemie Elhadad
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2001
pdf
bib
Sentence Ordering in Multidocument Summarization
Regina Barzilay
|
Noemie Elhadad
|
Kathleen R. McKeown
Proceedings of the First International Conference on Human Language Technology Research