2023
pdf
bib
abs
AlGhafa Evaluation Benchmark for Arabic Language Models
Ebtesam Almazrouei
|
Ruxandra Cojocaru
|
Michele Baldo
|
Quentin Malartic
|
Hamza Alobeidli
|
Daniele Mazzotta
|
Guilherme Penedo
|
Giulia Campesan
|
Mugariya Farooq
|
Maitha Alhammadi
|
Julien Launay
|
Badreddine Noune
Proceedings of ArabicNLP 2023
Recent advances in the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and continuously increasing NLP resources, the Arabic LLM landscape has improved in a very short span of time, despite being plagued by training data scarcity and limited evaluation resources compared to English. In line with contributing towards this ever-growing field, we introduce AlGhafa, a new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a new suite of models, including a 14 billion parameter model, the largest monolingual Arabic decoder-only model to date. We use a collection of publicly available datasets, as well as a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.
2022
pdf
bib
abs
PAGnol: An Extra-Large French Generative Model
Julien Launay
|
E.l. Tommasone
|
Baptiste Pannier
|
François Boniface
|
Amélie Chatelain
|
Alessandro Cappelli
|
Iacopo Poli
|
Djamé Seddah
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained from scratch for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
pdf
bib
abs
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
|
Thomas Wang
|
Daniel Hesslow
|
Stas Bekman
|
M Saiful Bari
|
Stella Biderman
|
Hady Elsahar
|
Niklas Muennighoff
|
Jason Phang
|
Ofir Press
|
Colin Raffel
|
Victor Sanh
|
Sheng Shen
|
Lintang Sutawika
|
Jaesung Tae
|
Zheng Xin Yong
|
Julien Launay
|
Iz Beltagy
Findings of the Association for Computational Linguistics: EMNLP 2022
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone.In the process of building BLOOM–the Big Science Large Open-science Open-access Multilingual language model–our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget.Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization.In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience.
pdf
bib
abs
A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model
Imad Lakim
|
Ebtesam Almazrouei
|
Ibrahim Abualhaol
|
Merouane Debbah
|
Julien Launay
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
As ever larger language models grow more ubiquitous, it is crucial to consider their environmental impact. Characterised by extreme size and resource use, recent generations of models have been criticised for their voracious appetite for compute, and thus significant carbon footprint. Although reporting of carbon impact has grown more common in machine learning papers, this reporting is usually limited to compute resources used strictly for training. In this work, we propose a holistic assessment of the footprint of an extreme-scale language model, Noor. Noor is an ongoing project aiming to develop the largest multi-task Arabic language models–with up to 13B parameters–leveraging zero-shot generalisation to enable a wide range of downstream tasks via natural language instructions. We assess the total carbon bill of the entire project: starting with data collection and storage costs, including research and development budgets, pretraining costs, future serving estimates, and other exogenous costs necessary for this international cooperation. Notably, we find that inference costs and exogenous factors can have a significant impact on total budget. Finally, we discuss pathways to reduce the carbon footprint of extreme-scale models.