Enhanced Entity Annotations for Multilingual Corpora

Michael Strobl, Amine Trabelsi, Osmar Zaïane


Abstract
Modern approaches in Natural Language Processing (NLP) require, ideally, large amounts of labelled data for model training. However, new language resources, for example, for Named Entity Recognition (NER), Co-reference Resolution (CR), Entity Linking (EL) and Relation Extraction (RE), naming a few of the most popular tasks in NLP, have always been challenging to create since manual text annotations can be very time-consuming to acquire. While there may be an acceptable amount of labelled data available for some of these tasks in one language, there may be a lack of datasets in another. WEXEA is a tool to exhaustively annotate entities in the English Wikipedia. Guidelines for editors of Wikipedia articles result, on the one hand, in only a few annotations through hyperlinks, but on the other hand, make it easier to exhaustively annotate the rest of these articles with entities than starting from scratch. We propose the following main improvements to WEXEA: Creating multi-lingual corpora, improved entity annotations using a proven NER system, annotating dates and times. A brief evaluation of the annotation quality of WEXEA is added.
Anthology ID:
2022.lrec-1.398
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3732–3740
Language:
URL:
https://aclanthology.org/2022.lrec-1.398
DOI:
Bibkey:
Cite (ACL):
Michael Strobl, Amine Trabelsi, and Osmar Zaïane. 2022. Enhanced Entity Annotations for Multilingual Corpora. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3732–3740, Marseille, France. European Language Resources Association.
Cite (Informal):
Enhanced Entity Annotations for Multilingual Corpora (Strobl et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.398.pdf
Code
 mjstrobl/wexea