2024
pdf
bib
abs
ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus
Injy Hamed
|
Fadhl Eryani
|
David Palfreyman
|
Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.
pdf
bib
Proceedings of The Second Arabic Natural Language Processing Conference
Nizar Habash
|
Houda Bouamor
|
Ramy Eskander
|
Nadi Tomeh
|
Ibrahim Abu Farha
|
Ahmed Abdelali
|
Samia Touileb
|
Injy Hamed
|
Yaser Onaizan
|
Bashar Alhafni
|
Wissam Antoun
|
Salam Khalifa
|
Hatem Haddad
|
Imed Zitouni
|
Badr AlKhamissi
|
Rawan Almatham
|
Khalil Mrini
Proceedings of The Second Arabic Natural Language Processing Conference
pdf
bib
abs
NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed
|
Amr Keleg
|
AbdelRahim Elmadany
|
Chiyu Zhang
|
Injy Hamed
|
Walid Magdy
|
Houda Bouamor
|
Nizar Habash
Proceedings of The Second Arabic Natural Language Processing Conference
We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI’s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.
2023
pdf
bib
abs
Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
Injy Hamed
|
Nizar Habash
|
Thang Vu
Findings of the Association for Computational Linguistics: EMNLP 2023
Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.
pdf
bib
abs
Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation
Injy Hamed
|
Nizar Habash
|
Slim Abdennadher
|
Ngoc Thang Vu
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.
pdf
bib
abs
Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
Marwa Gaser
|
Manuel Mager
|
Injy Hamed
|
Nizar Habash
|
Slim Abdennadher
|
Ngoc Thang Vu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
2022
pdf
bib
abs
ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English
Injy Hamed
|
Nizar Habash
|
Slim Abdennadher
|
Ngoc Thang Vu
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic-English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.
2020
pdf
bib
abs
Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
Mohamed Balabel
|
Injy Hamed
|
Slim Abdennadher
|
Ngoc Thang Vu
|
Özlem Çetinoğlu
Proceedings of the Twelfth Language Resources and Evaluation Conference
Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.
pdf
bib
abs
ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English
Injy Hamed
|
Ngoc Thang Vu
|
Slim Abdennadher
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
2018
pdf
bib
Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus
Injy Hamed
|
Mohamed Elmahdy
|
Slim Abdennadher
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
abs
A Game with a Purpose for Automatic Detection of Children’s Speech Disabilities using Limited Speech Resources
Reem Salem
|
Mohamed Elmahdy
|
Slim Abdennadher
|
Injy Hamed
Proceedings of the 1st Workshop on Natural Language Processing and Information Retrieval associated with RANLP 2017
Speech therapists and researchers are becoming more concerned with the use of computer-based systems in the therapy of speech disorders. In this paper, we propose a computer-based game with a purpose (GWAP) for speech therapy of Egyptian speaking children suffering from Dyslalia. Our aim is to detect if a certain phoneme is pronounced correctly. An Egyptian Arabic speech corpus has been collected. A baseline acoustic model was trained using the Egyptian corpus. In order to benefit from existing large amounts of Modern Standard Arabic (MSA) resources, MSA acoustic models were adapted with the collected Egyptian corpus. An independent testing set that covers common speech disorders has been collected for Egyptian speakers. Results show that adapted acoustic models give better recognition accuracy which could be relied on in the game and that children show more interest in playing the game than in visiting the therapist. A noticeable progress in children Dyslalia appeared with the proposed system.