Thierry Etchegoyhen


2023

pdf bib
What Works When in Context-aware Neural Machine Translation?
Harritxu Gete | Thierry Etchegoyhen | Gorka Labaka
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Document-level Machine Translation has emerged as a promising means to enhance automated translation quality, but it is currently unclear how effectively context-aware models use the available context during translation. This paper aims to provide insight into the current state of models based on input concatenation, with an in-depth evaluation on English–German and English–French standard datasets. We notably evaluate the impact of data bias, antecedent part-of-speech, context complexity, and the syntactic function of the elements involved in discursive phenomena. Our experimental results indicate that the selected models do improve the overall translation in context, with varying sensitivity to the different factors we examined. We notably show that the selected context-aware models operate markedly better on regular syntactic configurations involving subject antecedents and pronouns, with degraded performance as the configurations become more dissimilar.

pdf bib
Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages
Thierry Etchegoyhen | David Ponce
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Quality Estimation (QE) of Machine Translation output suffers from the lack of annotated data to train supervised models across domains and language pairs. In this work, we describe a method to generate synthetic QE data based on Neural Machine Translation (NMT) models at different learning stages. Our approach consists in training QE models on the errors produced by different NMT model checkpoints, obtained during the course of model training, under the assumption that gradual learning will induce errors that more closely resemble those produced by NMT models in adverse conditions. We test this approach on English-German and Romanian-English WMT QE test sets, demonstrating that pairing translations from earlier checkpoints with translations of converged models outperforms the use of reference human translations and can achieve competitive results against human-labelled data. We also show that combining post-edited data with our synthetic data yields to significant improvements across the board. Our approach thus opens new possibilities for an efficient use of monolingual corpora to generate quality synthetic QE data, thereby mitigating the data bottleneck.

pdf bib
Targeted Data Augmentation Improves Context-aware Neural Machine Translation
Harritxu Gete | Thierry Etchegoyhen | Gorka Labaka
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Progress in document-level Machine Translation is hindered by the lack of parallel training data that include context information. In this work, we evaluate the potential of data augmentation techniques to circumvent these limitations, showing that significant gains can be achieved via upsampling, similar context sampling and back-translations, targeted on context-relevant data. We apply these methods on standard document-level datasets in English-German and English-French and demonstrate their relevance to improve the translation of contextual phenomena. In particular, we show that relatively small volumes of targeted data augmentation lead to significant improvements over a strong context-concatenation baseline and standard back-translation of document-level data. We also compare the accuracy of the selected methods depending on data volumes or distance to relevant context information, and explore their use in combination.

pdf bib
An Evaluation of Source Factors in Concatenation-Based Context-Aware Neural Machine Translation
Harritxu Gete | Thierry Etchegoyhen
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

We explore the use of source factors in context-aware neural machine translation, specifically concatenation-based models, to improve the translation quality of inter-sentential phenomena. Context sentences are typically concatenated to the sentence to be translated, with string-based markers to separate the latter from the former. Although previous studies have measured the impact of prefixes to identify and mark context information, the use of learnable factors has only been marginally explored. In this study, we evaluate the impact of single and multiple source context factors in English-German and Basque-Spanish contextual translation. We show that this type of factors can significantly enhance translation accuracy for phenomena such as gender and register coherence in Basque-Spanish, while also improving BLEU results in some scenarios. These results demonstrate the potential of factor-based context identification to improve context-aware machine translation in future research.

pdf bib
Unsupervised Subtitle Segmentation with Masked Language Models
David Ponce | Thierry Etchegoyhen | Victor Ruiz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We describe a novel unsupervised approach to subtitle segmentation, based on pretrained masked language models, where line endings and subtitle breaks are predicted according to the likelihood of punctuation to occur at candidate segmentation points. Our approach obtained competitive results in terms of segmentation accuracy across metrics, while also fully preserving the original text and complying with length constraints. Although supervised models trained on in-domain data and with access to source audio information can provide better segmentation accuracy, our approach is highly portable across languages and domains and may constitute a robust off-the-shelf solution for subtitle segmentation.

2022

pdf bib
TANDO: A Corpus for Document-level Machine Translation
Harritxu Gete | Thierry Etchegoyhen | David Ponce | Gorka Labaka | Nora Aranberri | Ander Corral | Xabier Saralegi | Igor Ellakuria | Maite Martin
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Document-level Neural Machine Translation aims to increase the quality of neural translation models by taking into account contextual information. Properly modelling information beyond the sentence level can result in improved machine translation output in terms of coherence, cohesion and consistency. Suitable corpora for context-level modelling are necessary to both train and evaluate context-aware systems, but are still relatively scarce. In this work we describe TANDO, a document-level corpus for the under-resourced Basque-Spanish language pair, which we share with the scientific community. The corpus is composed of parallel data from three different domains and has been prepared with context-level information. Additionally, the corpus includes contrastive test sets for fine-grained evaluations of gender and register contextual phenomena on both source and target language sides. To establish the usefulness of the corpus, we trained and evaluated baseline Transformer models and context-aware variants based on context concatenation. Our results indicate that the corpus is suitable for fine-grained evaluation of document-level machine translation systems.

2021

pdf bib
Online Learning over Time in Adaptive Neural Machine Translation
Thierry Etchegoyhen | David Ponce | Harritxu Gete | Victor Ruiz
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Adaptive Machine Translation purports to dynamically include user feedback to improve translation quality. In a post-editing scenario, user corrections of machine translation output are thus continuously incorporated into translation models, reducing or eliminating repetitive error editing and increasing the usefulness of automated translation. In neural machine translation, this goal may be achieved via online learning approaches, where network parameters are updated based on each new sample. This type of adaptation typically requires higher learning rates, which can affect the quality of the models over time. Alternatively, less aggressive online learning setups may preserve model stability, at the cost of reduced adaptation to user-generated corrections. In this work, we evaluate different online learning configurations over time, measuring their impact on user-generated samples, as well as separate in-domain and out-of-domain datasets. Results in two different domains indicate that mixed approaches combining online learning with periodic batch fine-tuning might be needed to balance the benefits of online learning with model stability.

2020

pdf bib
To Case or not to case: Evaluating Casing Methods for Neural Machine Translation
Thierry Etchegoyhen | Harritxu Gete
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a comparative evaluation of casing methods for Neural Machine Translation, to help establish an optimal pre- and post-processing methodology. We trained and compared system variants on data prepared with the main casing methods available, namely translation of raw data without case normalisation, lowercasing with recasing, truecasing, case factors and inline casing. Machine translation models were prepared on WMT 2017 English-German and English-Turkish datasets, for all translation directions, and the evaluation includes reference metric results as well as a targeted analysis of case preservation accuracy. Inline casing, where case information is marked along lowercased words in the training data, proved to be the optimal approach overall in these experiments.

pdf bib
Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation
Thierry Etchegoyhen | Harritxu Gete
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the results of a case study in the exploitation of comparable corpora for Neural Machine Translation. A large comparable corpus for Basque-Spanish was prepared, on the basis of independently-produced news by the Basque public broadcaster EiTB, and we discuss the impact of various techniques to exploit the original data in order to determine optimal variants of the corpus. In particular, we show that filtering in terms of alignment thresholds and length-difference outliers has a significant impact on translation quality. The impact of tags identifying comparable data in the training datasets is also evaluated, with results indicating that this technique might be useful to help the models discriminate noisy information, in the form of informational imbalance between aligned sentences. The final corpus was prepared according to the experimental results and is made available to the scientific community for research purposes.

pdf bib
ELRI: A Decentralised Network of National Relay Stations to Collect, Prepare and Share Language Resources
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | José Luis Fonseca | Patricia Fonseca | Paulo Vale | Jane Dunne | Federico Gaspari | Teresa Lynn | Helen McHugh | Andy Way | Victoria Arranz | Khalid Choukri | Hervé Pusset | Alexandre Sicard | Rui Neto | Maite Melero | David Perez | António Branco | Ruben Branco | Luís Gomes
Proceedings of the 1st International Workshop on Language Technology Platforms

We describe the European Language Resource Infrastructure (ELRI), a decentralised network to help collect, prepare and share language resources. The infrastructure was developed within a project co-funded by the Connecting Europe Facility Programme of the European Union, and has been deployed in the four Member States participating in the project, namely France, Ireland, Portugal and Spain. ELRI provides sustainable and flexible means to collect and share language resources via National Relay Stations, to which members of public institutions can freely subscribe. The infrastructure includes fully automated data processing engines to facilitate the preparation, sharing and wider reuse of useful language resources that can help optimise human and automated translation services in the European Union.

pdf bib
The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) Project
Ēriks Ajausks | Victoria Arranz | Laurent Bié | Aleix Cerdà-i-Cucó | Khalid Choukri | Montse Cuadros | Hans Degroote | Amando Estela | Thierry Etchegoyhen | Mercedes García-Martínez | Aitor García-Pablos | Manuel Herranz | Alejandro Kohan | Maite Melero | Mike Rosner | Roberts Rozis | Patrick Paroubek | Artūrs Vasiļevskis | Pierre Zweigenbaum
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We describe the MAPA project, funded under the Connecting Europe Facility programme, whose goal is the development of an open-source de-identification toolkit for all official European Union languages. It will be developed since January 2020 until December 2021.

2018

pdf bib
Evaluating Domain Adaptation for Machine Translation Across Scenarios
Thierry Etchegoyhen | Anna Fernández Torné | Andoni Azpeitia | Eva Martínez Garcia | Anna Matamala
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Supervised and Unsupervised Minimalist Quality Estimators: Vicomtech’s Participation in the WMT 2018 Quality Estimation Task
Thierry Etchegoyhen | Eva Martínez Garcia | Andoni Azpeitia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We describe Vicomtech’s participation in the WMT 2018 shared task on quality estimation, for which we submitted minimalist quality estimators. The core of our approach is based on two simple features: lexical translation overlaps and language model cross-entropy scores. These features are exploited in two system variants: uMQE is an unsupervised system, where the final quality score is obtained by averaging individual feature scores; sMQE is a supervised variant, where the final score is estimated by a Support Vector Regressor trained on the available annotated datasets. The main goal of our minimalist approach to quality estimation is to provide reliable estimators that require minimal deployment effort, few resources, and, in the case of uMQE, do not depend on costly data annotation or post-editing. Our approach was applied to all language pairs in sentence quality estimation, obtaining competitive results across the board.

pdf bib
STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering
Andoni Azpeitia | Thierry Etchegoyhen | Eva Martínez Garcia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We describe Vicomtech’s participation in the WMT 2018 Shared Task on parallel corpus filtering. We aimed to evaluate a simple approach to the task, which can efficiently process large volumes of data and can be easily deployed for new datasets in different language pairs and domains. We based our approach on STACC, an efficient and portable method for parallel sentence identification in comparable corpora. To address the specifics of the corpus filtering task, which features significant volumes of noisy data, the core method was expanded with a penalty based on the amount of unknown words in sentence pairs. Additionally, we experimented with a complementary data saturation method based on source sentence n-grams, with the goal of demoting parallel sentence pairs that do not contribute significant amounts of yet unobserved n-grams. Our approach requires no prior training and is highly efficient on the type of large datasets featured in the corpus filtering task. We achieved competitive results with this simple and portable method, ranking in the top half among competing systems overall.

pdf bib
Neural Machine Translation of Basque
Thierry Etchegoyhen | Eva Martínez Garcia | Andoni Azpeitia | Gorka Labaka | Iñaki Alegria | Itziar Cortes Etxabe | Amaia Jauregi Carrera | Igor Ellakuria Santos | Maite Martin | Eusebi Calonge
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We describe the first experimental results in neural machine translation for Basque. As a synthetic language featuring agglutinative morphology, an extended case system, complex verbal morphology and relatively free word order, Basque presents a large number of challenging characteristics for machine translation in general, and for data-driven approaches such as attentionbased encoder-decoder models in particular. We present our results on a large range of experiments in Basque-Spanish translation, comparing several neural machine translation system variants with both rule-based and statistical machine translation systems. We demonstrate that significant gains can be obtained with a neural network approach for this challenging language pair, and describe optimal configurations in terms of word segmentation and decoding parameters, measured against test sets that feature multiple references to account for word order variability.

pdf bib
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | Paulo Vale | José Luis Fonseca | Teresa Lynn | Jane Dunne | Federico Gaspari | Andy Way | Victoria Arranz | Khalid Choukri | Vladimir Popescu | Pedro Neiva | Rui Neto | Maite Melero | David Perez Fernandez | Antonio Branco | Ruben Branco | Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.

2017

pdf bib
Weighted Set-Theoretic Alignment of Comparable Sentences
Andoni Azpeitia | Thierry Etchegoyhen | Eva Martínez Garcia
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.

pdf bib
Exploiting Relative Frequencies for Data Selection
Thierry Etchegoyhen | Andoni Azpeitia | Eva Martinez García
Proceedings of Machine Translation Summit XVI: Research Track

2016

pdf bib
Exploiting a Large Strongly Comparable Corpus
Thierry Etchegoyhen | Andoni Azpeitia | Naiara Pérez
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.

pdf bib
DOCAL - Vicomtech’s Participation in the WMT16 Shared Task on Bilingual Document Alignment
Andoni Azpeitia | Thierry Etchegoyhen
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
A Portable Method for Parallel and Comparable Document Alignment
Thierry Etchegoyhen | Andoni Azpeitia
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
Set-Theoretic Alignment for Comparable Corpora
Thierry Etchegoyhen | Andoni Azpeitia
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2014

pdf bib
Machine Translation for Subtitling: A Large-Scale Evaluation
Thierry Etchegoyhen | Lindsay Bywood | Mark Fishel | Panayota Georgakopoulou | Jie Jiang | Gerard van Loenhout | Arantza del Pozo | Mirjam Sepesy Maučec | Anja Turner | Martin Volk
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work was carried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measure of productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combine professionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairs that were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings. The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applying quality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fully machine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect for any eventual adoption of machine translation technology in professional subtitling.

2013

pdf bib
SMT Approaches for Commercial Translation of Subtitles
Thierry Etchegoyhen | Mark Fishel | Jie Jiang | Mirjam Sepesy Maucec
Proceedings of Machine Translation Summit XIV: User track

1998

pdf bib
System Demonstration Overview of GBGen
Thierry Etchegoyhen | Thomas Wehrle
Natural Language Generation