OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan

Aleksandra Miletic, Yves Scherrer


Abstract
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext’s language identification model from Meta AI’s No Language Left Behind initiative, released in July 2022.
Anthology ID:
2022.vardial-1.8
Volume:
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
70–79
Language:
URL:
https://aclanthology.org/2022.vardial-1.8
DOI:
Bibkey:
Cite (ACL):
Aleksandra Miletic and Yves Scherrer. 2022. OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 70–79, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan (Miletic & Scherrer, VarDial 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.vardial-1.8.pdf