A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Amir Zeldes, Nick Howell, Noam Ordan, Yifat Ben Moshe


Abstract
Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima’an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.
Anthology ID:
2022.emnlp-main.292
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4331–4344
Language:
URL:
https://aclanthology.org/2022.emnlp-main.292
DOI:
10.18653/v1/2022.emnlp-main.292
Bibkey:
Cite (ACL):
Amir Zeldes, Nick Howell, Noam Ordan, and Yifat Ben Moshe. 2022. A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4331–4344, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing (Zeldes et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.292.pdf
Dataset:
 2022.emnlp-main.292.dataset.zip