2024
pdf
bib
abs
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
|
Iker García-Ferrero
|
Alon Jacovi
|
Jon Ander Campos
|
Yanai Elazar
|
Eneko Agirre
|
Yoav Goldberg
|
Wei-Lin Chen
|
Jenny Chim
|
Leshem Choshen
|
Luca D’Amico-Wong
|
Melissa Dell
|
Run-Ze Fan
|
Shahriar Golchin
|
Yucheng Li
|
Pengfei Liu
|
Bhavish Pahwa
|
Ameya Prabhu
|
Suryansh Sharma
|
Emily Silcock
|
Kateryna Solonko
|
David Stap
|
Mihai Surdeanu
|
Yu-Min Tseng
|
Vishaal Udandarao
|
Zengzhi Wang
|
Ruijie Xu
|
Jinglin Yang
Proceedings of the 1st Workshop on Data Contamination (CONDA)
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
pdf
bib
abs
How Far can 100 Samples Go? Unlocking Zero-Shot Translation with Tiny Multi-Parallel Data
Di Wu
|
Shaomu Tan
|
Yan Meng
|
David Stap
|
Christof Monz
Findings of the Association for Computational Linguistics ACL 2024
Zero-shot translation aims to translate between language pairs not seen during training in Multilingual Machine Translation (MMT) and is widely considered an open problem. A common, albeit resource-consuming, solution is to add as many related translation directions as possible to the training corpus. In this paper, we show that for an English-centric model, surprisingly large zero-shot improvements can be achieved by simply fine-tuning with a very small amount of multi-parallel data. For example, on the EC30 dataset, we obtain up to +21.7 ChrF++ non-English overall improvements (870 directions) by using only 100 multi-parallel samples while preserving English-centric translation quality. This performance exceeds M2M100 by an average of 5.9 ChrF++ in the involved non-English directions. When investigating the size effect of fine-tuning data on translation quality, we found that already a small, randomly sampled set of fine-tuning directions is sufficient to achieve comparable improvements. The resulting non-English performance is close to the complete translation upper bound. Even in a minimal setting—fine-tuning with only one single sample—the well-known off-target issue is almost completely resolved, explaining parts—but not all—of the observed improvements in translation quality.
pdf
bib
abs
The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
David Stap
|
Eva Hasler
|
Bill Byrne
|
Christof Monz
|
Ke Tran
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters.Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
2023
pdf
bib
abs
ChatGPT is not a good indigenous translator
David Stap
|
Ali Araabi
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
This report investigates the continuous challenges of Machine Translation (MT) systems on indigenous and extremely low-resource language pairs. Despite the notable achievements of Large Language Models (LLMs) that excel in various tasks, their applicability to low-resource languages remains questionable. In this study, we leveraged the AmericasNLP competition to evaluate the translation performance of different systems for Spanish to 11 indigenous languages from South America. Our team, LTLAmsterdam, submitted a total of four systems including GPT-4, a bilingual model, fine-tuned M2M100, and a combination of fine-tuned M2M100 with $k$NN-MT. We found that even large language models like GPT-4 are not well-suited for extremely low-resource languages. Our results suggest that fine-tuning M2M100 models can offer significantly better performance for extremely low-resource translation.
pdf
bib
abs
Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens
David Stap
|
Vlad Niculae
|
Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2023
We argue that translation quality alone is not a sufficient metric for measuring knowledge transfer in multilingual neural machine translation. To support this claim, we introduce Representational Transfer Potential (RTP), which measures representational similarities between languages. We show that RTP can measure both positive and negative transfer (interference), and find that RTP is strongly correlated with changes in translation quality, indicating that transfer does occur. Furthermore, we investigate data and language characteristics that are relevant for transfer, and find that multi-parallel overlap is an important yet under-explored feature. Based on this, we develop a novel training scheme, which uses an auxiliary similarity loss that encourages representations to be more invariant across languages by taking advantage of multi-parallel data. We show that our method yields increased translation quality for low- and mid-resource languages across multiple data and model setups.
pdf
bib
abs
Improving Domain Robustness in Neural Machine Translation with Fused Topic Knowledge Embeddings
Danai Xezonaki
|
Talaat Khalil
|
David Stap
|
Brandon Denis
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Domain robustness is a key challenge for Neural Machine Translation (NMT). Translating text from a different distribution than the training set requires the NMT models to generalize well to unseen domains. In this work we propose a novel way to address domain robustness, by fusing external topic knowledge into the NMT architecture. We employ a pretrained denoising autoencoder and fuse topic information into the system during continued pretraining, and finetuning of the model on the downstream NMT task. Our results show that incorporating external topic knowledge, as well as additional pretraining can improve the out-of-domain performance of NMT models. The proposed methodology meets state-of-the-art on out-of-domain performance. Our analysis shows that a low overlap between the pretraining and finetuning corpora, as well as the quality of topic representations help the NMT systems become more robust under domain shift.
pdf
bib
abs
UvA-MT’s Participation in the WMT 2023 General Translation Shared Task
Di Wu
|
Shaomu Tan
|
David Stap
|
Ali Araabi
|
Christof Monz
Proceedings of the Eighth Conference on Machine Translation
This paper describes the UvA-MT’s submission to the WMT 2023 shared task on general machine translation. We participate in the constrained track in two directions: English ↔ Hebrew. In this competition, we show that by using one model to handle bidirectional tasks, as a minimal setting of Multilingual Machine Translation (MMT), it is possible to achieve comparable results with that of traditional bilingual translation for both directions. By including effective strategies, like back-translation, re-parameterized embedding table, and task-oriented fine-tuning, we obtained competitive final results in the automatic evaluation for both English → Hebrew and Hebrew → English directions.
pdf
bib
abs
Multilingual k-Nearest-Neighbor Machine Translation
David Stap
|
Christof Monz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
k-nearest-neighbor machine translation has demonstrated remarkable improvements in machine translation quality by creating a datastore of cached examples. However, these improvements have been limited to high-resource language pairs, with large datastores, and remain a challenge for low-resource languages. In this paper, we address this issue by combining representations from multiple languages into a single datastore. Our results consistently demonstrate substantial improvements not only in low-resource translation quality (up to +3.6 BLEU), but also for high-resource translation quality (up to +0.5 BLEU). Our experiments show that it is possible to create multilingual datastores that are a quarter of the size, achieving a 5.3x speed improvement, by using linguistic similarities for datastore creation.
2022
pdf
bib
abs
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Yizhong Wang
|
Swaroop Mishra
|
Pegah Alipoormolabashi
|
Yeganeh Kordi
|
Amirreza Mirzaei
|
Atharva Naik
|
Arjun Ashok
|
Arut Selvan Dhanasekaran
|
Anjana Arunkumar
|
David Stap
|
Eshaan Pathak
|
Giannis Karamanolakis
|
Haizhi Lai
|
Ishan Purohit
|
Ishani Mondal
|
Jacob Anderson
|
Kirby Kuznia
|
Krima Doshi
|
Kuntal Kumar Pal
|
Maitreya Patel
|
Mehrad Moradshahi
|
Mihir Parmar
|
Mirali Purohit
|
Neeraj Varshney
|
Phani Rohitha Kaza
|
Pulkit Verma
|
Ravsehaj Singh Puri
|
Rushang Karia
|
Savan Doshi
|
Shailaja Keyur Sampat
|
Siddhartha Mishra
|
Sujan Reddy A
|
Sumanta Patro
|
Tanay Dixit
|
Xudong Shen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions—training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones.Furthermore, we build Tk-Instruct, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-Instruct outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.