Part-of-Speech Tagging for Northern Kurdish

Peshmerge Morad, Sina Ahmadi, Lorenzo Gatti


Abstract
In the growing domain of natural language processing, low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of part-of-speech tagging and tokenization for Northern Kurdish are still insufficiently addressed. In this study, we aim to bridge this gap by evaluating a range of statistical, neural, and fine-tuned-based models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the Universal Dependency Kurmanji treebank and a novel manually annotated and tokenized gold-standard dataset consisting of 136 sentences (2,937 tokens). We evaluate several POS tagging models and report that the fine-tuned transformer-based model outperforms others, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. Data and models are publicly available under an open license at https://github.com/peshmerge/northern-kurdish-pos-tagging
Anthology ID:
2024.mwe-1.11
Volume:
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
Venues:
MWE | UDW | WS
SIGs:
SIGLEX | SIGPARSE
Publisher:
ELRA and ICCL
Note:
Pages:
70–80
Language:
URL:
https://aclanthology.org/2024.mwe-1.11
DOI:
Bibkey:
Cite (ACL):
Peshmerge Morad, Sina Ahmadi, and Lorenzo Gatti. 2024. Part-of-Speech Tagging for Northern Kurdish. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 70–80, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Part-of-Speech Tagging for Northern Kurdish (Morad et al., MWE-UDW-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.mwe-1.11.pdf