Post-editing is a very common use case for Machine Translation, and human evaluation of post-edits with MQM error annotation can reveal a treasure trove of insights which help inform engine training and other quality improvement strategies. However, a manual workflow for this gets very costly very fast at enterprise scale, and those insights never get discovered nor acted upon. How can MT teams scale this process in an efficient way across dozens of languages and multiple translation tools where post-editing is done, while applying risk modeling to maximize their Return on Investment into costly Human Evaluation? We’ll share strategies learnt from our work on automating human evaluation workflows for some of the world’s best Machine Translation teams at corporates, governments, and LSPs.
Human evaluation remains a critical step in selecting the best MT model for a job. The common approach is to have a reviewer analyze a number of segments translated by the compared models, assigning them categories and also post-editing some of them when needed. In other words, a reviewer is asked to make numerous decisions regarding very similar, out-of-context translations. It can easily result in arbitrary choices. We propose a new methodology that is centered around a real-life post-editing of a set of cohesive homogeneous texts. The homogeneity is established using a number of metrics on a set of preselected same-genre documents. The key assumption is that two or more identical in length homogeneous texts take approximately the same time and effort when edited by the same editor. Hence, if one text requires more work (edit distance, time spent), it is an indication of a relatively lower quality of machine translation used for this text. See details in the attached file.
Ever-improving quality in MT makes it increasingly difficult for users to identify errors, sometimes obvious and other times subtle but treacherous, such as in patents and IP. Linguists, developers, and other “humans in the loop” should be ready to adapt their approaches to checking technical translations for accuracy. In this talk, real-world Chinese-to-English patent translations will be used in side-by-side comparisons of raw MT output and courtroom-ready products. The types of issues that can make or break a post-edited translation will be illustrated, with discussion of the principles underlying the error types. Certain nuances that challenge both humans and machines must be revealed in order to create a translation product that withstands the scrutiny of the attorneys, scientists, and inventors who might procure it. This talk will explore the nature of error detection and classification when reviewing patent texts translated by humans, computers, or a combination thereof.
In this talk, we cover the 2022 annual report on State of the Machine Translation, prepared together by Intento and e2f. The report analyses the performance of 20+ commercial MT engines across 9 industries (General, Colloquial, Education, Entertainment, Financial, Healthcare, Hospitality, IT, and Legal) and 10+ key language pairs. For the first time, this report is run using a unique dataset covering all language/domain combinations above, prepared by e2f. The presentation would focus on the process of data selection and preparation, the report methodology, principal scores to rely on when studying MT outcomes (COMET, BERTScore, PRISM, TER, and hLEPOR), and the main report outcomes (best performing MT engines for every language / domain combination). It includes a thorough comparison of the scores. It also covers language support, prices, and other features of the MT engines.
As global enterprises focus on improving CX (customer experience) across the world we see the following impact: Huge increase in dynamic, unstructured CX related content; Substantial increase in content and translation volumes; Increased use of “raw” MT; Changes in the view of translation quality; Changes in the kinds of tools and processes used to enable effective massive-scale translation capabilities. This presentation will provide examples of the content changes and it’s impact on optimal tools and the translation production process. Several use case and case studies will be provided to illustrate the growing need for better man-machine collaboration and will also highlight emerging best practices that show that MT has only begun it’s deep engagement with international business initiatives for any global enterprise.
We have all seen the successes of Machine Assisted captioning, translation, and voiceovers and we have also seen the embarrassing errors of the same engines. Real-life usage, of course, is somewhere between the two. This session will show a couple of real-life examples of Speech To Text (STT), Machine Translation (MT) and Text To Speech (TTS) using Neural voices. We will look at what you would expect to be a perfect candidate for Automatic Speech Recognition (ASR) using multiple commercial engines and then seeing how well they can be transferred to a multiple MT engines. We will also see how its usage in AudioVisual Translation is different from a standard text translation. I will also give a brief demo of how well modern neural voices perform in multiple languages based on input from AVT timed text (vtt) format files.
This presentation will show two experiments conducted to evaluate the adequacy of OpenAI’s GPT-3 (as a representative of Large Language Models), for the purposes of post-editing and translating texts from English into Spanish, using a glossary of terms to ensure term consistency. The experiments are motivated by a use case in ULG MT Production, where we need to improve the usage of terminology glossaries in our NMT system. The purpose of the experiments is to take advantage of GPT-3 outstanding capabilities to generate text for completion and editing. We have used the edits end-point to post-edit the output of a NMT system using a glossary, and the completions end-point to translate the source text, including the glossary term list in the corresponding GPT-3 prompt. While the results are promising, they also show that there is room for improvement by fine-tuning the models, working on prompt engineering, and adjusting the requests parameters.
Consistency is one of the desired quality features in final translations. For human-only translations (without MT), we rely on the translator’s ability to achieve consistency. For MT, consistency is neither guaranteed nor expected. MT may actually generate inconsistencies, and it is left to the post-editor to introduce consistency in a manual fashion. This work presents a method that facilitates the improvement of consistency without the need of a glossary. It detects inconsistencies in the post-edited work, and gives the post-editor the opportunity to fix the translation towards consistency. We describe the method, which is simple and involves only a short Python script, and also provide numbers that show its positive impact. This method is a contribution to a broader set of quality checks that can improve language quality of human and MT translations.
Multilingual search is indispensable for a seamless e-commerce experience. E-commerce search engines typically support multilingual search by cascading a machine translation step before searching the index in its primary language. In practice, search query translation usually involves a translation memory matching step before machine translation. A translation memory (TM) can (i) effectively enforce terminologies for specific brands or products (ii) reduce the computation footprint and latency for synchronous translation and, (iii) fix machine translation issues that cannot be resolved easily or quickly without retraining/tuning the machine translation engine in production. In this abstract, we will propose (1) a method of improving MT query translation using such TM entries when the TM entries are only sub-strings of a customer search query, and (2) an approach to selecting TM entries using search signals that can contribute to better search results.
“Who said what” is essential for users to understand video streams that have more than one speaker, but conventional simultaneous interpretation systems merely present “what was said” in the form of subtitles. Because the translations unavoidably have delays and errors, users often find it difficult to trace the subtitles back to speakers. To address this problem, we propose a multimodal SI system that presents users “who said what”. Our system takes audio-visual approaches to recognize the speaker of each sentence, and then annotates its translation with the textual tag and face icon of the speaker, so that users can quickly understand the scenario. Furthermore, our system is capable of interpreting video streams in real-time on a single desktop equipped with two Quadro RTX 4000 GPUs owing to an efficient sentence-based architecture.
Machine translation becomes a critical piece of localization industry. With all kinds of different data, how to monitor the machine translation quality in your localized content? How to build the quality analytics framework? This paper describes a process starting from collecting the daily operation data then cleaning the data and building the analytics framework to get the insight into the data. Finally we’re going to share how to build the data collecting matrix, and the script to clean up the data, then run the analytics with an automation script. In the last, we would share the different visualized reports, such as Box Polit, Standard Deviation, Mean, MT touchpoint and golden ratio reports.
A growing share of machine translations are approved - untouched - by human translators in post-editing workflows. But they still cost time and money. Now companies are getting human post-editing quality faster and cheaper, by automatically approving the good machine translations - at human accuracy. The approach has evolved, from research papers on machine translation quality estimation, to adoption inside companies like Amazon, Facebook, Microsoft and VMWare, to self-serve cloud APIs like ModelFront. We’ll walk through the motivations, use cases, prerequisites, adopters, providers, integration and ROI.
The authors of this study compared two types of translation quality scores assigned to the same sets of translation samples: 1) the ATA Grading Framework scores assigned by human experts, and 2) auto scores, including BLEU, TER, and COMET (with and without reference). They further explored the impact of different reference translations on the auto scores. Key findings from this study include: 1. auto scores that rely on reference translations depend heavily on which reference is used; 2. referenceless COMET seems promising when it is used to evaluate translations of short passages (250-300 English words); and 3. evidence suggests good agreement between the ATA-Framework score and some auto scores within a middle range, but the relationship becomes non-monotonic beyond the middle range. This study is subject to the limitation of a small sample size and is a retrospective exploratory study not specifically designed to test a pre-defined hypothesis.
Lingua is an application developed for the Church of Jesus Christ of Latter-day Saints that performs both real-time interpretation of live speeches and automatic video dubbing (AVD). Like other AVD systems, it can perform synchronized automatic dubbing, given video files and optionally, corresponding text files using a traditional ASR–MT–TTS pipeline. Lingua’s unique contribution is that it can also operate in real-time with a slight delay of a few seconds to interpret live speeches. If no source-language script is provided, the translations are exactly as recognized by ASR and translated by MT. If a script is provided, Lingua matches the recognized ASR segments with script segments and passes the latter to MT for translation and subsequent TTS. If a human translation is also provided, it is passed directly to TTS. Lingua switches between these modes dynamically, enabling translation of off-script comments and different levels of quality for multiple languages. (see extended abstract)
Segment-level Quality Estimation (QE) is an increasingly sought-after task in the Machine Translation (MT) industry. In recent years, it has experienced an impressive evolution not only thanks to the implementation of supervised models using source and hypothesis information, but also through the usage of MT probabilities. This work presents a different approach to QE where only the source segment and the Neural MT (NMT) training data are needed, making possible an approximation to translation quality before inference. Our work is based on the idea that NMT quality at a segment level depends on the similarity degree between the source segment to be translated and the engine’s training data. The features proposed measuring this aspect of data achieve competitive correlations with MT metrics and human judgment and prove to be advantageous for post-editing (PE) prioritization task with domain adapted engines.
Knowledge distillation (KD) can be used to reduce model size and training time, without significant loss in performance. However, the process of distilling knowledge requires translation of sizeable data sets, and the translation is usually performed using large cumbersome models (teacher models). Producing such translations for KD is expensive in terms of both time and cost, which is a significant concern for translation service providers. On top of that, this process can be the cause of higher carbon footprints. In this work, we tested different variants of a teacher model for KD, tracked the power consumption of the GPUs used during translation, recorded overall translation time, estimated translation cost, and measured the accuracy of the student models. The findings of our investigation demonstrate to the translation industry a cost-effective, high-quality alternative to the standard KD training methods.
Frameworks such as Multidimensional Quality Metrics (MQM) provide detailed feedback on translation quality and can pinpoint concrete linguistic errors. The quality of a translation is, however, also closely tied to its utility in a particular use case. Many customers have highly subjective expectations of translation quality. Features such as register, discourse style and brand consistency can be difficult to accommodate given a broadly applied translation solution. In this presentation we will introduce the concept of Business Critical Errors (BCE). Adapted from MQM, the BCE framework provides a perspective on translation quality that allows us to be reactive and adaptive to expectation whilst also maintaining consistency in our translation evaluation. We will demonstrate tooling used at Unbabel that allows us to evaluate the performance of our MT models on BCE using specialized test suites as well as the ability of our AI evaluation models to successfully capture BCE information.
In this article, we trained what we believe to be the first MT system adapted to video game translation and show that very limited in-domain data is enough to largely surpass publicly available systems, while also revealing interesting findings in the final translation. After introducing some of the challenges of video game translation, existing literature, as well as the systems and data sets used in this experiment, we provide and discuss the resulting translation as well as the potential benefits of such a system. We find that the model is able to learn typical rules and patterns of video game translations from English into French, indicating that the case of video game machine translation could prove useful given the encouraging results and the specific working conditions of translators this field. As with other use cases of MT in cultural sectors, however, we believe this is heavily dependent on the proper implementation of the tool, which we think could to stimulate creativity.
At Comparis, we are rolling out our MT program for locales with limited support out-of-the-box and language pairs with limited support for customization. As a leading online company in Switzerland, our content goes from Swiss Standard German (de-CH) into fr-CH, it-CH and en-UK. Even the best generic MT engines perform poorly and many don’t even offer customization for language pairs without English. This would result in unusable raw MT and very high PE effort. So we needed custom machine translation, but at a reasonable cost and with a sustainable effort. We evaluated the self-serve machine translation, the machine translation quality estimation tools like ModelFront, and integration options in the translation management systems (TMSes). Using new tools and our existing assets (TMs), custom MT and new AI tools we launched a successful in-house MT program with productivity gains and iterative improvement. We also defined and launched service tiers, from light MTPE to transcreation.
This presentation demonstrates data augmentation methods for Neural Machine Translation to make use of similar translations, in a comparable way a human translator employs fuzzy matches. We show how we simply feed the neural model with information on both source and target sides of the fuzzy matches, and we also extend the similarity to include semantically related translations retrieved using distributed sentence representations. We show that translations based on fuzzy matching provide the model with “copy” information while translations based on embedding similarities tend to extend the translation “context”. Results indicate that the effect from both similar sentences are adding up to further boost accuracy, are combining naturally with model fine-tuning and are providing dynamic adaptation for unseen translation pairs. Tests on multiple data sets and domains show consistent accuracy improvements.
In the era of gigantic language models, and in our case, Neural Machine Translation (NMT) models, where merely size seems to matter, we’ve been asking ourselves, is it healthy to just feed our NMT model with more and more data? In this presentation, we want to show our findings on the impact of NMT performance of different data “nutrients” we were feeding our models. We have explored the impact of quantity, quality and the type of data we feed to our English-Arabic NMT models. The presentation will show the impact of adding millions of parallel sentences into our training data as opposed to a much smaller data set with much higher quality, and the results from additional experiments with different data nutrients. We will highlight our learnings, challenges and share insights from our Linguistics Quality Assurance team, on what are the advantages and disadvantages of each type of data source and define the criteria of high-quality data with respect to a healthy NMT diet.
With the increasing availability of large-scale parallel corpora derived from web crawling and bilingual text mining, data filtering is becoming an increasingly important step in neural machine translation (NMT) pipelines. This paper applies several available tools to the task of data filtration, and compares their performance in filtering out different types of noisy data. We also study the effect of filtration with each tool on model performance in the downstream task of NMT by creating a dataset containing a combination of clean and noisy data, filtering the data with each tool, and training NMT engines using the resulting filtered corpora. We evaluate the performance of each engine with a combination of direct assessment (DA) and automated metrics. Our best results are obtained by training for a short time on all available data then filtering the corpus with cross-entropy filtering and training until convergence.
Machine Translate is a non-profit organization on a mission to make machine translation more accessible to more people. As the field of machine translation continues to grow, the project builds open resources and a community for developers, buyers and translators. The project is ruled by three values: quality, openness and accessibility. Content is open-source and welcomes open-contribution. It is kept up-to-date, and its information is presented in a clear and well-organized format. Machine Translate aims to be accessible to people from many backgrounds and, ultimately, also non-English speakers. The project covers everything about machine translation, from products to research, from development to theory, and from history to news. The topics are very diverse, and the writing is focused on concepts rather than on mathematical details.
To unlock the value of high-quality bilingual translated documents we need parallel data. With sentence-aligned translation pairs, we can fuel our neural machine translation, customize MT or create translation memories for our clients. To automate this process, automatic segmentation and alignment are required. Despite Arabic being the fifth biggest language in the world, language technology for Arabic is many times way behind other languages. We will show how we struggled to find a proper sentence segmentation for Arabic and instead explored different frameworks, from statistical to deep learning, to end up fine-tuning our own Arabic DL segmentation model. We will highlight our learnings and challenges with segmenting and aligning Arabic and English bilingual data. Finally, we will show the impact on our proprietary NMT engine as we started to unlock the value and could leverage data that had been translated offline, outside CAT tools, as well as comparable corpora, to feed our NMT.
In this presentation, we will highlight the key technological innovations provided by Language I/O, see below. Dynamic MT engine selection based on the customer, content type, language pair, the content itself, as well as other metadata. Our proprietary MT quality estimation mechanism that allows customers to control their human review budget. The Self-Improving Glossary technology to continuously learn new keywords and key phrases based on the actual content processed in the platform.
In this talk I will present a proposed user study to measure the impact of potentially misleading MT output on MT-enabled scanning of foreign language text by intelligence analysts (IAs) and the effectiveness of a practical intervention: providing output from more than one NMT system to the user. The focus of the talk will be on the approach to de-signing the user study to resemble scanning tasks in a measurable way with unclassified documents.
Humans use document formatting to discover document and section titles, and important phrases. But when machines process a paper–especially documents OCRed from images–these cues are often invisible to downstream processes: words in footnotes or body text are treated as just as important as words in titles. It would be better for indexing and summarization tools to be guided by implicit document structure. In an ODNI-sponsored project, ARLIS looked at discovering formatting in OCRed text as a way to infer document structure. Most OCR engines output results as hOCR (an XML format), giving bounding boxes around characters. In theory, this also provides style information such as bolding and italicization, but in practice, this capability is limited. For example, the Tesseract OCR tool provides bounding boxes, but does not attempt to detect bold text (relevant to author emphasis and specialized fields in e.g. print dictionaries), and its discrimination of italicization is poor. Our project inferred font size from hOCR bounding boxes, and using that and other cues (e.g. the fact that titles tend to be short) determined which text constituted section titles; from this, a document outline can be created. We also experimented with algorithms for detecting bold text. Our best algorithm has a much improved recall and precision, although the exact numbers are font-dependent. The next step is to incorporate inferred structure into the output of machine translation. One way is to embed XML tags for inferred structure into the text extracted from the imaged document, and to either pass the strings enclosed by XML tags to the MT engine individually, or pass the tags through the MT engine without modification. This structural information can guide downstream bulk processing tasks such as summarization and search, and also enables building tables of contents for human users examining individual documents.
The Singapore’s Ministry of Communications and Information (MCI) has officially launched the SG Translate Together (SGTT) web portal on 27 June 2022, with the aim of partnering its citizens to improve translation standards in Singapore. This web portal houses the Singapore Government’s first neural machine translation (MT) engine, known as SG Translate, which was jointly developed by MCI and the Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR). Adapted using localised translation data, SG Translate is able to generate translations that are attuned to Singapore’s context and supports Singapore’s four (4) official languages – English (Singapore), Chinese (Singapore), Bahasa Melayu (Singapore) and Tamil (Singapore). Upon completion of development, MCI allowed all Government agencies to use SG Translate for their daily operations. This presentation will briefly cover the methodologies adopted and showcase SG Translate’s capability to translate content involving local culture, everyday life and government policies and schemes. This presentation will also showcase MCI’s sustainable approach for the continual training of the SG Translate MT engine through citizenry participation.
Cognitive effort is the core element of translation and interpreting process studies, but theoretical and practical issues such as the concept, characteristics and measurement of cognitive effort still need to be clarified. This paper firstly analyzes the concept and research characteristics of cognitive effort in translation and interpreting process studies. Then, based on the cost concept (internal cost, opportunity cost) and the reward concept (need for cognition, learned industriousness) of cognitive effort, it carries out multi-dimensional analysis of the characteristics of cognitive effort. Finally, it points out the enlightenment of multi-dimensional consideration of cognitive effort to translation and interpreting process studies.
The history of machine translation (MT) covers intricate patterns of technical, policy, social, and artistic threads, many of which have been documented by researchers such as John Hutchins and Dr. Harold Somers. However, the history of MT—including the history of MT in the United States—has stories that not yet been told or that have only received the briefest of nods to the extraordinary work achieved. This presentation would address some of those stories, including: the U.S. government organizations that created research programs such as the Defense Advanced Research Projects Agency (DARPA) and the National Science Foundation (NSF) and how the values of those founding organizations impacted the development of MT. It would address the almost unknown or nearly forgotten work of the Xerox Palo Alto Research Center (PARC), the Xerox Rochester Translation Center, and Systran in the late 1980s and early 1990s to develop automated post-editing tools, confidence measures, and multi-engine solutions. It would discuss and illustrate the astounding impact of MT in movies and literature since the 1950s that still shapes public perception of the technology as more than ready to conduct the complex, nuanced, and multilanguage business of individuals, empires, and alliances. In addition, this presentation would raise questions and promote discussion of how we as a community can continue to capture our colorful and fast-developing history. The stories and observations are drawn from research by the speaker to develop an article on “The History of Machine Translation in the United States,” which will be published later this year in The Routledge Encyclopedia of Machine Translation.
Seeking a clear roadmap for the translation services of the future, the Government of Canada’s Translation Bureau has spent the last few years modernizing its technology infrastructure and drawing up a strategy that would let it make the best of the opportunities opened up by artificial intelligence and computer-assisted translation tools. Yet in a sector that has gone from budding to thriving in such a short time, with a myriad options now available, it is no small feat to chart a course and move beyond the kid-in-the-candy-store stage. How can one distinguish between the flavour of the week and a sustainable way forward? Through a series of carefully planned proofs of concepts—and let’s say it, a fair share of trial and error—, a clear pathway to the future is shaping out for the Translation Bureau. Answers to some of the key questions of our times are beginning to take shape... and so are the challenges that stand in the way to success. The Translation Bureau’s Innovation Director Caroline-Soledad Mallette recounts lessons learned, surveys the lay of the land and outlines best practices in the search for an adaptative, best-fit solution for technology-augmented linguistic service provision. Join her as she suggests a new heading in our quest for progress: let the hype be focused not on technology, but on the people it empowers, with one ultimate goal in mind: quality.
Despite a narrowed performance gap with direct approaches, cascade solutions, involving automatic speech recognition (ASR) and machine translation (MT) are still largely employed in speech translation (ST). Direct approaches employing a single model to translate the input speech signal suffer from the critical bottleneck of data scarcity. In addition, multiple industry applications display speech transcripts alongside translations, making cascade approaches more realistic and practical. In the context of cascaded simultaneous ST, we propose several solutions to adapt a neural MT network to take as input the transcripts output by an ASR system. Adaptation is achieved by enriching speech transcripts and MT data sets so that they more closely resemble each other, thereby improving the system robustness to error propagation and enhancing result legibility for humans. We address aspects such as sentence boundaries, capitalisation, punctuation, hesitations, repetitions, homophones, etc. while taking into account the low latency requirement of simultaneous ST systems.
The National Virtual Translation Center (NVTC) and the larger Federal Bureau of Investiga-tion (FBI) seek to acquire tools that will facilitate its mission to provide English translations of non-English language audio and video files. In the text domain, NVTC has been using translation memory (TM) for some time and has reported on the incorporation of machine translation (MT) into that workflow. While we have explored the use of speech-to-text (STT) and speech translation (ST) in the past, we have now invested in the creation of a substantial human-created corpus to thoroughly evaluate alternatives in three languages: French, Rus-sian, and Persian. We report on the results of multiple STT systems combined with four MT systems for these languages. We evaluated and scored the different systems in combination and analyzed results. This points the way to the most successful tool combination to deploy in this workflow.