The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
This paper describes the University of Edinburgh’s submission to the AmericasNLP 2024 shared task on the translation of Spanish into 11 indigenous American languages. We explore the ability of multilingual Large Language Models (LLMs) to model low-resource languages by continued pre-training with LoRA, and conduct instruction fine-tuning using a variety of datasets, demonstrating that this improves LLM performance. Furthermore, we demonstrate the efficacy of checkpoint averaging alongside decoding techniques like beam search and sampling, resulting in further improvements. We participate in all 11 translation directions.
Attendees will learn about how we use machine translation to provide targeted, high MT quality for content with inline tags. We offer a new and innovative approach to inserting tags into the translated text in a way that reliably preserves their quality. This process can achieve better MT quality and lower costs, as it is MT-independent, and can be used for all languages, MT engines, and use cases.
This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.