Nora Aranberri


2024

pdf bib
COMET for Low-Resource Machine Translation Evaluation: A Case Study of English-Maltese and Spanish-Basque
Júlia Falcão | Claudia Borg | Nora Aranberri | Kurt Abela
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Trainable metrics for machine translation evaluation have been scoring the highest correlations with human judgements in the latest meta-evaluations, outperforming traditional lexical overlap metrics such as BLEU, which is still widely used despite its well-known shortcomings. In this work we look at COMET, a prominent neural evaluation system proposed in 2020, to analyze the extent of its language support restrictions, and to investigate strategies to extend this support to new, under-resourced languages. Our case study focuses on English-Maltese and Spanish-Basque. We run a crowd-based evaluation campaign to collect direct assessments and use the annotated dataset to evaluate COMET-22, further fine-tune it, and to train COMET models from scratch for the two language pairs. Our analysis suggests that COMET’s performance can be improved with fine-tuning, and that COMET can be highly susceptible to the distribution of scores in the training data, which especially impacts low-resource scenarios.

2023

pdf bib
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Mary Nurminen | Judith Brenner | Maarit Koponen | Sirkku Latomaa | Mikhail Mikhailov | Frederike Schierl | Tharindu Ranasinghe | Eva Vanmassenhove | Sergi Alvarez Vidal | Nora Aranberri | Mara Nunziatini | Carla Parra Escartín | Mikel Forcada | Maja Popovic | Carolina Scarton | Helena Moniz
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

2022

pdf bib
TANDO: A Corpus for Document-level Machine Translation
Harritxu Gete | Thierry Etchegoyhen | David Ponce | Gorka Labaka | Nora Aranberri | Ander Corral | Xabier Saralegi | Igor Ellakuria | Maite Martin
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Document-level Neural Machine Translation aims to increase the quality of neural translation models by taking into account contextual information. Properly modelling information beyond the sentence level can result in improved machine translation output in terms of coherence, cohesion and consistency. Suitable corpora for context-level modelling are necessary to both train and evaluate context-aware systems, but are still relatively scarce. In this work we describe TANDO, a document-level corpus for the under-resourced Basque-Spanish language pair, which we share with the scientific community. The corpus is composed of parallel data from three different domains and has been prepared with context-level information. Additionally, the corpus includes contrastive test sets for fine-grained evaluations of gender and register contextual phenomena on both source and target language sides. To establish the usefulness of the corpus, we trained and evaluated baseline Transformer models and context-aware variants based on context concatenation. Our results indicate that the corpus is suitable for fine-grained evaluation of document-level machine translation systems.

2020

pdf bib
With or without you? Effects of using machine translation to write flash fiction in the foreign language
Nora Aranberri
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The improvement in the quality of machine translation (MT) for both majority and minority languages in recent years is resulting in its steady adoption. This is not only happening among professional translators but also among users who occasionally find themselves in situations where translation is required or MT presents itself as a easier means to producing a text. This work sets to explore the effect using MT has in flash fiction produced in the foreign language. Specifically, we study the impact in surface closeness, syntactic and lexical complexity, and edits. Results show that texts produced with MT seem to fit closer to certain traits of the foreign language and that differences in the use of part-of-speech categories and structures emerge. Moreover, the analysis of the post-edited texts reveals that participants approach the editing of the MT output differently, displaying a wide range in the number of edits.

2019

pdf bib
Comparison of temporal, technical and cognitive dimension measurements for post-editing effort
Cristina Cumbreno | Nora Aranberri
Proceedings of the Second MEMENTO workshop on Modelling Parameters of Cognitive Effort in Translation Production

2018

pdf bib
Building Named Entity Recognition Taggers via Parallel Corpora
Rodrigo Agerri | Yiling Chung | Itziar Aldabe | Nora Aranberri | Gorka Labaka | German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Towards a post-editing recommendation system for Spanish-Basque machine translation
Nora Aranberri | Jose A. Pascual
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

The overall machine translation quality available for professional translators working with the Spanish–Basque pair is rather poor, which is a deterrent for its adoption. This work investigates the plausibility of building a comprehensive recommendation system to speed up decision time between post-editing or translation from scratch using the very limited training data available. First, we build a set of regression models that predict the post-editing effort in terms of overall quality, time and edits. Secondly, we build classification models that recommend the most efficient editing approach using post-editing effort features on top of linguistic features. Results show high correlations between the predictions of the regression models and the expected HTER, time and edit number values. Similarly, the results for the classifiers show that they are able to predict with high accuracy whether it is more efficient to translate or to post-edit a new segment.

2016

pdf bib
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)
Deyi Xiong | Kevin Duh | Eneko Agirre | Nora Aranberri | Houfeng Wang
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

pdf bib
Tools and Guidelines for Principled Machine Translation Development
Nora Aranberri | Eleftherios Avramidis | Aljoscha Burchardt | Ondřej Klejch | Martin Popel | Maja Popović
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work addresses the need to aid Machine Translation (MT) development cycles with a complete workflow of MT evaluation methods. Our aim is to assess, compare and improve MT system variants. We hereby report on novel tools and practices that support various measures, developed in order to support a principled and informed approach of MT development. Our toolkit for automatic evaluation showcases quick and detailed comparison of MT system variants through automatic metrics and n-gram feedback, along with manual evaluation via edit-distance, error annotation and task-based feedback.

pdf bib
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente | Iñaki Alegría | Cristina España-Bonet | Pablo Gamallo | Hugo Gonçalo Oliveira | Eva Martínez Garcia | Antonio Toral | Arkaitz Zubiaga | Nora Aranberri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

pdf bib
QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages
Arantxa Otegi | Nora Aranberri | Antonio Branco | Jan Hajič | Martin Popel | Kiril Simov | Eneko Agirre | Petya Osenova | Rita Pereira | João Silva | Steven Neale
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

2015

pdf bib
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Eneko Agirre | Nora Aranberri
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
SMT error analysis and mapping to syntactic, semantic and structural fixes
Nora Aranberri
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Deep-syntax TectoMT for English-Spanish MT
Gorka Labaka | Oneka Jauregi | Arantza Díaz de Ilarraza | Michael Ustaszewski | Nora Aranberri | Eneko Agirre
Proceedings of the 1st Deep Machine Translation Workshop

pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
Comparison of post-editing productivity between professional translators and lay users
Nora Aranberri | Gorka Labaka | Arantza Diaz de Ilarraza | Kepa Sarasola
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

This work compares the post-editing productivity of professional translators and lay users. We integrate an English to Basque MT system within Bologna Translation Service, an end-to-end translation management platform, and perform a producitivity experiment in a real working environment. Six translators and six lay users translate or post-edit two texts from English into Basque. Results suggest that overall, post-editing increases translation throughput for both translators and users, although the latter seem to benefit more from the MT output. We observe that translators and users perceive MT differently. Additionally, a preliminary analysis seems to suggest that familiarity with the domain, source text complexity and MT quality might affect potential productivity gain.

pdf bib
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.