2024
pdf
bib
abs
MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector
Marta Costa-jussà
|
Mariano Meglioli
|
Pierre Andrews
|
David Dale
|
Prangthip Hansanti
|
Elahe Kalbassi
|
Alexandre Mourachko
|
Christophe Ropers
|
Carleigh Wood
Findings of the Association for Computational Linguistics ACL 2024
Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels which covers 14 different linguistic families. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 28 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier performs on par with existing text-based trainable classifiers, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves F1-Score by an average of 100%. This significant improvement underscores the potential of MuTox in advancing the field of audio-based toxicity detection.
pdf
bib
abs
Pushing the Limits of Zero-shot End-to-End Speech Translation
Ioannis Tsiamas
|
Gerard I. Gállego
|
José Fonollosa
|
Marta Costa-jussà
Findings of the Association for Computational Linguistics ACL 2024
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method’s superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results.
pdf
bib
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Agnieszka Faleńska
|
Christine Basta
|
Marta Costa-jussà
|
Seraphina Goldfarb-Tarrant
|
Debora Nozza
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
pdf
bib
abs
Overview of the Shared Task on Machine Translation Gender Bias Evaluation with Multilingual Holistic Bias
Marta Costa-jussà
|
Pierre Andrews
|
Christine Basta
|
Juan Ciro
|
Agnieszka Falenska
|
Seraphina Goldfarb-Tarrant
|
Rafael Mosquera
|
Debora Nozza
|
Eduardo Sánchez
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
We describe the details of the Shared Task of the 5th ACL Workshop on Gender Bias in Natural Language Processing (GeBNLP 2024). The task uses dataset to investigate the quality of Machine Translation systems on a particular case of gender robustness. We report baseline results as well as the results of the first participants. The shared task will be permanently available in the Dynabench platform.
pdf
bib
abs
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert
|
Carlos Escolano
|
Marta Costa-jussà
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Our proposed method, RESETOX (REdoSEarch if TOXic), addresses the issue ofNeural Machine Translation (NMT) gener-ating translation outputs that contain toxicwords not present in the input. The ob-jective is to mitigate the introduction oftoxic language without the need for re-training. In the case of identified addedtoxicity during the inference process, RE-SETOX dynamically adjusts the key-valueself-attention weights and re-evaluates thebeam search hypotheses. Experimental re-sults demonstrate that RESETOX achievesa remarkable 57% reduction in added tox-icity while maintaining an average trans-lation quality of 99.5% across 164 lan-guages. Our code is available at: https://github.com
pdf
bib
abs
Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation
Marta Costa-jussà
|
David Dale
|
Maha Elbayad
|
Bokai Yu
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Machine translation models sometimes lead to added toxicity: translated outputs may contain more toxic content that the original input. In this paper, we introduce MinTox, a novel pipeline to automatically identify and mitigate added toxicity at inference time, without further model training. MinTox leverages a multimodal (speech and text) toxicity classifier that can scale across languages.We demonstrate the capabilities of MinTox when applied to SEAMLESSM4T, a multi-modal and massively multilingual machine translation system. MinTox significantly reduces added toxicity: across all domains, modalities and language directions, 25% to95% of added toxicity is successfully filtered out, while preserving translation quality
2023
pdf
bib
abs
SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations
Ioannis Tsiamas
|
José Fonollosa
|
Marta Costa-jussà
Findings of the Association for Computational Linguistics: EMNLP 2023
End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentence-level version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SegAugment, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SegAugment obtains state-of-the-art results in MuST-C. Finally, we show that the proposed method can also successfully augment sentence-level datasets, and that it enables Speech Translation models to close the gap between the manual and automatic segmentation at inference time.
pdf
bib
abs
Toxicity in Multilingual Machine Translation at Scale
Marta Costa-jussà
|
Eric Smith
|
Christophe Ropers
|
Daniel Licht
|
Jean Maillard
|
Javier Ferrando
|
Carlos Escolano
Findings of the Association for Computational Linguistics: EMNLP 2023
Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.
pdf
bib
abs
HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation
David Dale
|
Elena Voita
|
Janice Lam
|
Prangthip Hansanti
|
Christophe Ropers
|
Elahe Kalbassi
|
Cynthia Gao
|
Loic Barrault
|
Marta Costa-jussà
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Hallucinations in machine translation are translations that contain information completely unrelated to the input. Omissions are translations that do not include some of the input information. While both cases tend to be catastrophic errors undermining user trust, annotated data with these types of pathologies is extremely scarce and is limited to a few high-resource languages. In this work, we release an annotated dataset for the hallucination and omission phenomena covering 18 translation directions with varying resource levels and scripts. Our annotation covers different levels of partial and full hallucinations as well as omissions both at the sentence and at the word level. Additionally, we revisit previous methods for hallucination and omission detection, show that conclusions made based on a single language pair largely do not hold for a large-scale evaluation, and establish new solid baselines.
pdf
bib
abs
Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale
Marta Costa-jussà
|
Pierre Andrews
|
Eric Smith
|
Prangthip Hansanti
|
Christophe Ropers
|
Elahe Kalbassi
|
Cynthia Gao
|
Daniel Licht
|
Carleigh Wood
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
We introduce a multilingual extension of the HolisticBias dataset, the largest English template-based taxonomy of textual people references: Multilingual HolisticBias. This extension consists of 20,459 sentences in 50 languages distributed across 13 demographic axes. Source sentences are built from combinations of 118 demographic descriptors and three patterns, excluding nonsensical combinations. Multilingual translations include alternatives for gendered languages that cover gendered translations when there is ambiguity in English. Our dataset is intended to uncover demographic imbalances and be the tool to quantify mitigations towards them. Our initial findings show that translation quality for EN-to-XX translations is an average of almost 8 spBLEU better when evaluating with the masculine human reference compared to feminine. In the opposite direction, XX-to-EN, we compare the robustness of the model when the source input only differs in gender (masculine or feminine) and masculine translations are an average of almost 4 spBLEU better than feminine. When embedding sentences to a joint multilingual sentence representations space, we find that for most languages masculine translations are significantly closer to the English neutral sentences when embedded.
2022
pdf
bib
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
Elizabeth Salesky
|
Marcello Federico
|
Marta Costa-jussà
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
2021
pdf
bib
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing
Marta Costa-jussa
|
Hila Gonen
|
Christian Hardmeier
|
Kellie Webster
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing