Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text

Vasile Pais, Maria Mitrofan, Verginica Barbu Mititelu, Elena Irimia, Roxana Micu, Carol Luca Gasan


Abstract
Following the successful creation of a national representative corpus of contemporary Romanian language, we turned our attention to the social media text, as present in micro-blogging platforms. In this paper, we present the current activities as well as the challenges faced when trying to apply existing tools (for both annotation and indexing) to a Romanian language micro-blogging corpus. These challenges are encountered at all annotation levels, including tokenization, and at the indexing stage. We consider that existing tools for Romanian language processing must be adapted to recognize features such as emoticons, emojis, hashtags, unusual abbreviations, elongated words (commonly used for emphasis in micro-blogging), multiple words joined together (within oroutside hashtags), and code-mixed text.
Anthology ID:
2022.cmlc-1.1
Volume:
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Piotr Banski, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen
Venue:
CMLC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–7
Language:
URL:
https://aclanthology.org/2022.cmlc-1.1
DOI:
Bibkey:
Cite (ACL):
Vasile Pais, Maria Mitrofan, Verginica Barbu Mititelu, Elena Irimia, Roxana Micu, and Carol Luca Gasan. 2022. Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10), pages 1–7, Marseille, France. European Language Resources Association.
Cite (Informal):
Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text (Pais et al., CMLC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.cmlc-1.1.pdf
Data
LegalNERo