The Multilingual Corpus of World’s Constitutions (MCWC)

Mo El-Haj, Saad Ezzini


Abstract
The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.
Anthology ID:
2024.osact-1.7
Volume:
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed
Venues:
OSACT | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
57–66
Language:
URL:
https://aclanthology.org/2024.osact-1.7
DOI:
Bibkey:
Cite (ACL):
Mo El-Haj and Saad Ezzini. 2024. The Multilingual Corpus of World’s Constitutions (MCWC). In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 57–66, Torino, Italia. ELRA and ICCL.
Cite (Informal):
The Multilingual Corpus of World’s Constitutions (MCWC) (El-Haj & Ezzini, OSACT-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.osact-1.7.pdf