BaSCo: An Annotated Basque-Spanish Code-Switching Corpus for Natural Language Understanding

Maia Aguirre, Laura García-Sardiña, Manex Serras, Ariane Méndez, Jacobo López


Abstract
The main objective of this work is the elaboration and public release of BaSCo, the first corpus with annotated linguistic resources encompassing Basque-Spanish code-switching. The mixture of Basque and Spanish languages within the same utterance is popularly referred to as Euskañol, a widespread phenomenon among bilingual speakers in the Basque Country. Thus, this corpus has been created to meet the demand of annotated linguistic resources in Euskañol in research areas such as multilingual dialogue systems. The presented resource is the result of translating to Euskañol a compilation of texts in Basque and Spanish that were used for training the Natural Language Understanding (NLU) models of several task-oriented bilingual chatbots. Those chatbots were meant to answer specific questions associated with the administration, fiscal, and transport domains. In addition, they had the transverse potential to answer to greetings, requests for help, and chit-chat questions asked to chatbots. BaSCo is a compendium of 1377 tagged utterances with every sample annotated at three levels: (i) NLU semantic labels, considering intents and entities, (ii) code-switching proportion, and (iii) domain of origin.
Anthology ID:
2022.lrec-1.338
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3158–3163
Language:
URL:
https://aclanthology.org/2022.lrec-1.338
DOI:
Bibkey:
Cite (ACL):
Maia Aguirre, Laura García-Sardiña, Manex Serras, Ariane Méndez, and Jacobo López. 2022. BaSCo: An Annotated Basque-Spanish Code-Switching Corpus for Natural Language Understanding. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3158–3163, Marseille, France. European Language Resources Association.
Cite (Informal):
BaSCo: An Annotated Basque-Spanish Code-Switching Corpus for Natural Language Understanding (Aguirre et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.338.pdf