2024
pdf
bib
abs
The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj
|
Saad Ezzini
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.
pdf
bib
abs
AraFinNLP 2024: The First Arabic Financial NLP Shared Task
Sanad Malaysha
|
Mo El-Haj
|
Saad Ezzini
|
Mohammed Khalilia
|
Mustafa Jarrar
|
Sultan Almujaiwel
|
Ismail Berrada
|
Houda Bouamor
Proceedings of The Second Arabic Natural Language Processing Conference
The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots.A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.
2023
pdf
bib
Comparing Pre-Training Schemes for Luxembourgish BERT Models
Cedric Lothritz
|
Saad Ezzini
|
Christoph Purschke
|
Tegawendé Bissyandé
|
Jacques Klein
|
Isabella Olariu
|
Andrey Boytsov
|
Clément LeFebvre
|
Anne Goujon
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)
pdf
bib
abs
Evaluating the Impact of Text De-Identification on Downstream NLP Tasks
Cedric Lothritz
|
Bertrand Lebichot
|
Kevin Allix
|
Saad Ezzini
|
Tegawendé Bissyandé
|
Jacques Klein
|
Andrey Boytsov
|
Clément Lefebvre
|
Anne Goujon
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Data anonymisation is often required to comply with regulations when transfering information across departments or entities. However, the risk is that this procedure can distort the data and jeopardise the models built on it. Intuitively, the process of training an NLP model on anonymised data may lower the performance of the resulting model when compared to a model trained on non-anonymised data. In this paper, we investigate the impact of de-identification on the performance of nine downstream NLP tasks. We focus on the anonymisation and pseudonymisation of personal names and compare six different anonymisation strategies for two state-of-the-art pre-trained models. Based on these experiments, we formulate recommendations on how the de-identification should be performed to guarantee accurate NLP models. Our results reveal that de-identification does have a negative impact on the performance of NLP models, but this impact is relatively low. We also find that using pseudonymisation techniques involving random names leads to better performance across most tasks.