Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

Seth Kulick, Neville Ryant, Beatrice Santorini


Abstract
The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.7-million-word treebank that is an important resource for research in syntactic change, has several properties that present potential challenges for NLP technologies. We describe these key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank, and present results for this corpus using a modified version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al. (2006). While this approach to function tag recovery gives reasonable results, it is in some ways inappropriate for span-based parsers. We also present further evidence of the importance of in-domain pretraining for contextualized word representations. The resulting parser will be used to parse Early English Books Online, a 1.5 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.
Anthology ID:
2022.findings-naacl.44
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
578–593
Language:
URL:
https://aclanthology.org/2022.findings-naacl.44
DOI:
10.18653/v1/2022.findings-naacl.44
Bibkey:
Cite (ACL):
Seth Kulick, Neville Ryant, and Beatrice Santorini. 2022. Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 578–593, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis (Kulick et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.44.pdf
Video:
 https://aclanthology.org/2022.findings-naacl.44.mp4
Data
Penn Treebank