2023
pdf
bib
abs
Do “English” Named Entity Recognizers Work Well on Global Englishes?
Alexander Shan
|
John Bauer
|
Riley Carlson
|
Christopher Manning
Findings of the Association for Computational Linguistics: EMNLP 2023
The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops—over 10 F1 in some cases—when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.
pdf
bib
abs
Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs
John Bauer
|
Chloé Kiddon
|
Eric Yeh
|
Alex Shan
|
Christopher D. Manning
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)
Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.
2014
pdf
bib
abs
A Gold Standard Dependency Corpus for English
Natalia Silveira
|
Timothy Dozat
|
Marie-Catherine de Marneffe
|
Samuel Bowman
|
Miriam Connor
|
John Bauer
|
Chris Manning
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a gold standard annotation of syntactic dependencies in the English Web Treebank corpus using the Stanford Dependencies formalism. This resource addresses the lack of a gold standard dependency treebank for English, as well as the limited availability of gold standard syntactic annotations for English informal text genres. We also present experiments on the use of this resource, both for training dependency parsers and for evaluating the quality of different versions of the Stanford Parser, which includes a converter tool to produce dependency annotation from constituency trees. We show that training a dependency parser on a mix of newswire and web data leads to better performance on that type of data without hurting performance on newswire text, and therefore gold standard annotations for non-canonical text can be a valuable resource for parsing. Furthermore, the systematic annotation effort has informed both the SD formalism and its implementation in the Stanford Parser’s dependency converter. In response to the challenges encountered by annotators in the EWT corpus, the formalism has been revised and extended, and the converter has been improved.
pdf
bib
The Stanford CoreNLP Natural Language Processing Toolkit
Christopher Manning
|
Mihai Surdeanu
|
John Bauer
|
Jenny Finkel
|
Steven Bethard
|
David McClosky
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations
2013
pdf
bib
Parsing with Compositional Vector Grammars
Richard Socher
|
John Bauer
|
Christopher D. Manning
|
Andrew Y. Ng
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
Feature-Rich Phrase-based Translation: Stanford University’s Submission to the WMT 2013 Translation Task
Spence Green
|
Daniel Cer
|
Kevin Reschke
|
Rob Voigt
|
John Bauer
|
Sida Wang
|
Natalia Silveira
|
Julia Neidert
|
Christopher D. Manning
Proceedings of the Eighth Workshop on Statistical Machine Translation
2011
pdf
bib
Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
Spence Green
|
Marie-Catherine de Marneffe
|
John Bauer
|
Christopher D. Manning
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing