Stephan Vogel

Also published as: S. Vogel, Stephen Vogel


2018

pdf bib
The WAW Corpus: The First Corpus of Interpreted Speeches and their Translations for English and Arabic
Ahmed Abdelali | Irina Temnikova | Samy Hedaya | Stephan Vogel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation
Fahim Dalvi | Nadir Durrani | Hassan Sajjad | Stephan Vogel
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We address the problem of simultaneous translation by modifying the Neural MT decoder to operate with dynamically built encoder and attention. We propose a tunable agent which decides the best segmentation strategy for a user-defined BLEU loss and Average Proportion (AP) constraint. Our agent outperforms previously proposed Wait-if-diff and Wait-if-worse agents (Cho and Esipova, 2016) on BLEU with a lower latency. Secondly we proposed data-driven changes to Neural MT training to better match the incremental decoding framework.

2017

pdf bib
Neural Machine Translation Training in a Multi-Domain Scenario
Hassan Sajjad | Nadir Durrani | Fahim Dalvi | Yonatan Belinkov | Stephan Vogel
Proceedings of the 14th International Conference on Spoken Language Translation

In this paper, we explore alternative ways to train a neural machine translation system in a multi-domain scenario. We investigate data concatenation (with fine tuning), model stacking (multi-level fine tuning), data selection and multi-model ensemble. Our findings show that the best translation quality can be achieved by building an initial system on a concatenation of available out-of-domain data and then fine-tuning it on in-domain data. Model stacking works best when training begins with the furthest out-of-domain data and the model is incrementally fine-tuned with the next furthest domain and so on. Data selection did not give the best results, but can be considered as a decent compromise between training time and translation quality. A weighted ensemble of different individual models performed better than data selection. It is beneficial in a scenario when there is no time for fine-tuning an already trained model.

pdf bib
Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder
Fahim Dalvi | Nadir Durrani | Hassan Sajjad | Yonatan Belinkov | Stephan Vogel
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

End-to-end training makes the neural machine translation (NMT) architecture simpler, yet elegant compared to traditional statistical machine translation (SMT). However, little is known about linguistic patterns of morphology, syntax and semantics learned during the training of NMT systems, and more importantly, which parts of the architecture are responsible for learning each of these phenomenon. In this paper we i) analyze how much morphology an NMT decoder learns, and ii) investigate whether injecting target morphology in the decoder helps it to produce better translations. To this end we present three methods: i) simultaneous translation, ii) joint-data learning, and iii) multi-task learning. Our results show that explicit morphological information helps the decoder learn target language morphology and improves the translation quality by 0.2–0.6 BLEU points.

pdf bib
Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging
Hassan Sajjad | Fahim Dalvi | Nadir Durrani | Ahmed Abdelali | Yonatan Belinkov | Stephan Vogel
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

pdf bib
QCRI Live Speech Translation System
Fahim Dalvi | Yifan Zhang | Sameer Khurana | Nadir Durrani | Hassan Sajjad | Ahmed Abdelali | Hamdy Mubarak | Ahmed Ali | Stephan Vogel
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This paper presents QCRI’s Arabic-to-English live speech translation system. It features modern web technologies to capture live audio, and broadcasts Arabic transcriptions and English translations simultaneously. Our Kaldi-based ASR system uses the Time Delay Neural Network (TDNN) architecture, while our Machine Translation (MT) system uses both phrase-based and neural frameworks. Although our neural MT system is slower than the phrase-based system, it produces significantly better translations and is memory efficient. The demo is available at https://st.qcri.org/demos/livetranslation.

pdf bib
The SUMMA Platform Prototype
Renars Liepins | Ulrich Germann | Guntis Barzdins | Alexandra Birch | Steve Renals | Susanne Weber | Peggy van der Kreeft | Hervé Bourlard | João Prieto | Ondřej Klejch | Peter Bell | Alexandros Lazaridis | Alfonso Mendes | Sebastian Riedel | Mariana S. C. Almeida | Pedro Balage | Shay B. Cohen | Tomasz Dwojak | Philip N. Garner | Andreas Giefer | Marcin Junczys-Dowmunt | Hina Imran | David Nogueira | Ahmed Ali | Sebastião Miranda | Andrei Popescu-Belis | Lesly Miculicich Werlen | Nikos Papasarantopoulos | Abiola Obamuyide | Clive Jones | Fahim Dalvi | Andreas Vlachos | Yang Wang | Sibo Tong | Rico Sennrich | Nikolaos Pappas | Shashi Narayan | Marco Damonte | Nadir Durrani | Sameer Khurana | Ahmed Abdelali | Hassan Sajjad | Stephan Vogel | David Sheppey | Chris Hernon | Jeff Mitchell
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.

pdf bib
Robust Tuning Datasets for Statistical Machine Translation
Preslav Nakov | Stephan Vogel
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We explore the idea of automatically crafting a tuning dataset for Statistical Machine Translation (SMT) that makes the hyper-parameters of the SMT system more robust with respect to some specific deficiencies of the parameter tuning algorithms. This is an under-explored research direction, which can allow better parameter tuning. In this paper, we achieve this goal by selecting a subset of the available sentence pairs, which are more suitable for specific combinations of optimizers, objective functions, and evaluation measures. We demonstrate the potential of the idea with the pairwise ranking optimization (PRO) optimizer, which is known to yield too short translations. We show that the learning problem can be alleviated by tuning on a subset of the development set, selected based on sentence length. In particular, using the longest 50% of the tuning sentences, we achieve two-fold tuning speedup, and improvements in BLEU score that rival those of alternatives, which fix BLEU+1’s smoothing instead.

bib
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology
Irina Temnikova | Constantin Orasan | Gloria Corpas Pastor | Stephan Vogel
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology

pdf bib
Interpreting Strategies Annotation in the WAW Corpus
Irina Temnikova | Ahmed Abdelali | Samy Hedaya | Stephan Vogel | Aishah Al Daher
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology

With the aim to teach our automatic speech-to-text translation system human interpreting strategies, our first step is to identify which interpreting strategies are most often used in the language pair of our interest (English-Arabic). In this article we run an automatic analysis of a corpus of parallel speeches and their human interpretations, and provide the results of manually annotating the human interpreting strategies in a sample of the corpus. We give a glimpse of the corpus, whose value surpasses the fact that it contains a high number of scientific speeches with their interpretations from English into Arabic, as it also provides rich information about the interpreters. We also discuss the difficulties, which we encountered on our way, as well as our solutions to them: our methodology for manual re-segmentation and alignment of parallel segments, the choice of annotation tool, and the annotation procedure. Our annotation findings explain the previously extracted specific statistical features of the interpreted corpus (compared with a translation one) as well as the quality of interpretation provided by different interpreters.

2016

bib
An Empirical Study: Post-editing Effort for English to Arabic Hybrid Machine Translation
Hassan Sajjad | Francisco Guzman | Stephan Vogel
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib
Applying the Cognitive Machine Translation Evaluation Approach to Arabic
Irina Temnikova | Wajdi Zaghouani | Stephan Vogel | Nizar Habash
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of the cognitive machine translation (MT) evaluation approach is to build classifiers which assign post-editing effort scores to new texts. The approach helps estimate fair compensation for post-editors in the translation industry by evaluating the cognitive difficulty of post-editing MT output. The approach counts the number of errors classified in different categories on the basis of how much cognitive effort they require in order to be corrected. In this paper, we present the results of applying an existing cognitive evaluation approach to Modern Standard Arabic (MSA). We provide a comparison of the number of errors and categories of errors in three MSA texts of different MT quality (without any language-specific adaptation), as well as a comparison between MSA texts and texts from three Indo-European languages (Russian, Spanish, and Bulgarian), taken from a previous experiment. The results show how the error distributions change passing from the MSA texts of worse MT quality to MSA texts of better MT quality, as well as a similarity in distinguishing the texts of better MT quality for all four languages.

pdf bib
Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement
Hassan Sajjad | Francisco Guzmán | Nadir Durrani | Ahmed Abdelali | Houda Bouamor | Irina Temnikova | Stephan Vogel
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
QCRI’s Machine Translation Systems for IWSLT’16
Nadir Durrani | Fahim Dalvi | Hassan Sajjad | Stephan Vogel
Proceedings of the 13th International Conference on Spoken Language Translation

This paper describes QCRI’s machine translation systems for the IWSLT 2016 evaluation campaign. We participated in the Arabic→English and English→Arabic tracks. We built both Phrase-based and Neural machine translation models, in an effort to probe whether the newly emerged NMT framework surpasses the traditional phrase-based systems in Arabic-English language pairs. We trained a very strong phrase-based system including, a big language model, the Operation Sequence Model, Neural Network Joint Model and Class-based models along with different domain adaptation techniques such as MML filtering, mixture modeling and using fine tuning over NNJM model. However, a Neural MT system, trained by stacking data from different genres through fine-tuning, and applying ensemble over 8 models, beat our very strong phrase-based system by a significant 2 BLEU points margin in Arabic→English direction. We did not obtain similar gains in the other direction but were still able to outperform the phrase-based system. We also applied system combination on phrase-based and NMT outputs.

2015

pdf bib
How do Humans Evaluate Machine Translation
Francisco Guzmán | Ahmed Abdelali | Irina Temnikova | Hassan Sajjad | Stephan Vogel
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Second Workshop on Arabic Natural Language Processing
Nizar Habash | Stephan Vogel | Kareem Darwish
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
Using joint models or domain adaptation in statistical machine translation
Nadir Durrani | Hassan Sajjad | Shafiq Joty | Ahmed Abdelali | Stephan Vogel
Proceedings of Machine Translation Summit XV: Papers

pdf bib
Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity, PRO Learns Length
Francisco Guzmán | Preslav Nakov | Stephan Vogel
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
How to Avoid Unwanted Pregnancies: Domain Adaptation using Neural Network Models
Shafiq Joty | Hassan Sajjad | Nadir Durrani | Kamla Al-Mannai | Ahmed Abdelali | Stephan Vogel
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
Nizar Habash | Stephan Vogel
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Unsupervised Word Segmentation Improves Dialectal Arabic to English Machine Translation
Kamla Al-Mannai | Hassan Sajjad | Alaa Khader | Fahad Al Obaidli | Preslav Nakov | Stephan Vogel
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR
Ahmed Ali | Hamdy Mubarak | Stephan Vogel
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

This paper reports results in building an Egyptian Arabic speech recognition system as an example for under-resourced languages. We investigated different approaches to build the system using 10 hours for training the acoustic model, and results for both grapheme system and phoneme system using MADA. The phoneme-based system shows better results than the grapheme-based system. In this paper, we explore the use of tweets written in dialectal Arabic. Using 880K Egyptian tweets reduced the Out Of Vocabulary (OOV) rate from 15.1% to 3.2% and the WER from 59.6% to 44.7%, a relative gain 25% in WER.

pdf bib
The AMARA Corpus: Building Parallel Language Resources for the Educational Domain
Ahmed Abdelali | Francisco Guzman | Hassan Sajjad | Stephan Vogel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multilingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.

2013

pdf bib
Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation
ThuyLinh Nguyen | Stephan Vogel
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Tale about PRO and Monsters
Preslav Nakov | Francisco Guzmán | Stephan Vogel
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation
Hassan Sajjad | Francisco Guzmán | Preslav Nakov | Ahmed Abdelali | Kenton Murray | Fahad Al Obaidli | Stephan Vogel
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

We describe the Arabic-English and English-Arabic statistical machine translation systems developed by the Qatar Computing Research Institute for the IWSLT’2013 evaluation campaign on spoken language translation. We used one phrase-based and two hierarchical decoders, exploring various settings thereof. We further experimented with three domain adaptation methods, and with various Arabic word segmentation schemes. Combining the output of several systems yielded a gain of up to 3.4 BLEU points over the baseline. Here we also describe a specialized normalization scheme for evaluating Arabic output, which was adopted for the IWSLT’2013 evaluation campaign.

pdf bib
The AMARA corpus: building resources for translating the web’s educational content
Francisco Guzman | Hassan Sajjad | Stephan Vogel | Ahmed Abdelali
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align the segments, and extrinsically evaluate the resulting parallel corpus on the standard TED-talks tst-2010. We observe that the data can be successfully used for this task, and also observe an absolute improvement of 1.6 BLEU when it is used in combination with TED data. Finally, we analyze some of the specific challenges when translating the educational content.

pdf bib
Parameter Optimization for Statistical Machine Translation: It Pays to Learn from Hard Examples
Preslav Nakov | Fahad Al Obaidli | Francisco Guzmán | Stephan Vogel
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
MT Quality Estimation: The CMU System for WMT‘13
Silja Hildebrand | Stephan Vogel
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf bib
QCRI at WMT12: Experiments in Spanish-English and German-English Machine Translation of News Text
Francisco Guzmán | Preslav Nakov | Ahmed Thabet | Stephan Vogel
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Understanding the Performance of Statistical MT Systems: A Linear Regression Framework
Francisco Guzman | Stephan Vogel
Proceedings of COLING 2012

pdf bib
Optimizing for Sentence-Level BLEU+1 Yields Short Translations
Preslav Nakov | Francisco Guzman | Stephan Vogel
Proceedings of COLING 2012

pdf bib
Nonparametric Model for Inupiaq Word Segmentation
Thuy Linh Nguyen | Stephan Vogel
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
Andreas Zollmann | Stephan Vogel
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
Qin Gao | Stephan Vogel
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment
Shujian Huang | Stephan Vogel | Jiajun Chen
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Utilizing Target-Side Semantic Role Labels to Assist Hierarchical Phrase-based Machine Translation
Qin Gao | Stephan Vogel
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Extracting Parallel Phrases from Comparable Data
Sanjika Hewavitharana | Stephan Vogel
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Active Learning with Multiple Annotations for Comparable Data Classification Task
Vamshi Ambati | Sanjika Hewavitharana | Stephan Vogel | Jaime Carbonell
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Wider Context by Using Bilingual Language Models in Machine Translation
Jan Niehues | Teresa Herrmann | Stephan Vogel | Alex Waibel
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
CMU Haitian Creole-English Translation System for WMT 2011
Sanjika Hewavitharana | Nguyen Bach | Qin Gao | Vamshi Ambati | Stephan Vogel
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Crisis MT: Developing A Cookbook for MT in Crisis Situations
William Lewis | Robert Munro | Stephan Vogel
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Multi-Strategy Approaches to Active Learning for Statistical Machine Translation
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
TriS: A Statistical Sentence Simplifier with Log-linear Models and Margin-based Discriminative Training
Nguyen Bach | Qin Gao | Stephan Vogel | Alex Waibel
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Extending a probabilistic phrase alignment approach for SMT
Mridul Gupta | Sanjika Hewavitharana | Stephan Vogel
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

Phrase alignment is a crucial step in phrase-based statistical machine translation. We explore a way of improving phrase alignment by adding syntactic information in the form of chunks as soft constraints guided by an in-depth and detailed analysis on a hand-aligned data set. We extend a probabilistic phrase alignment model that extracts phrase pairs by optimizing phrase pair boundaries over the sentence pair [1]. The boundaries of the target phrase are chosen such that the overall sentence alignment probability is optimal. Viterbi alignment information is also added in the extended model with a view of improving phrase alignment. We extract phrase pairs using a relatively larger number of features which are discriminatively trained using a large-margin online learning algorithm, i.e., Margin Infused Relaxed Algorithm (MIRA) and integrate it in our approach. Initial experiments show improvements in both phrase alignment and translation quality for Arabic-English on a moderate-size translation task.

2010

pdf bib
Active Semi-Supervised Learning for Improving Word Alignment
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing

pdf bib
Semi-supervised Word Alignment with Mechanical Turk
Qin Gao | Stephan Vogel
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Can Crowds Build parallel corpora for Machine Translation Systems?
Vamshi Ambati | Stephan Vogel
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
A Semi-Supervised Word Alignment Algorithm with Partial Manual Alignments
Qin Gao | Nguyen Bach | Stephan Vogel
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
CMU System Combination via Hypothesis Selection for WMT’10
Almut Silja Hildebrand | Stephan Vogel
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
New Parameterizations and Features for PSCFG-Based Machine Translation
Andreas Zollmann | Stephan Vogel
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf bib
A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks
Qin Gao | Stephan Vogel
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Fixed Length Word Suffix for Factored Statistical Machine Translation
Narges Sharif Razavian | Stephan Vogel
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Active Learning-Based Elicitation for Semi-Supervised Word Alignment
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Active Learning and Crowd-Sourcing for Machine Translation
Vamshi Ambati | Stephan Vogel | Jaime Carbonell
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Large scale parallel data generation for new language pairs requires intensive human effort and availability of experts. It becomes immensely difficult and costly to provide Statistical Machine Translation (SMT) systems for most languages due to the paucity of expert translators to provide parallel data. Even if experts are present, it appears infeasible due to the impending costs. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic translation for low-resource language pairs. Active learning aims at reducing cost of label acquisition by prioritizing the most informative data for annotation, while crowd-sourcing reduces cost by using the power of the crowds to make do for the lack of expensive language experts. We experiment and compare our active learning strategies with strong baselines and see significant improvements in translation quality. Similarly, our experiments with crowd-sourcing on Mechanical Turk have shown that it is possible to create parallel corpora using non-experts and with sufficient quality assurance, a translation system that is trained using this corpus approaches expert quality.

pdf bib
EMDC: A Semi-supervised Approach for Word Alignment
Qin Gao | Francisco Guzman | Stephan Vogel
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Nonparametric Word Segmentation for Machine Translation
ThuyLinh Nguyen | Stephan Vogel | Noah A. Smith
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation
Ashish Venugopal | Andreas Zollmann | Noah A. Smith | Stephan Vogel
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Cohesive Constraints in A Beam Search Phrase-based Decoder
Nguyen Bach | Stephan Vogel | Colin Cherry
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Incremental Adaptation of Speech-to-Speech Translation
Nguyen Bach | Roger Hsiao | Matthias Eck | Paisarn Charoenpornsawat | Stephan Vogel | Tanja Schultz | Ian Lane | Alex Waibel | Alan Black
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
CMU System Combination for WMT‘09
Almut Silja Hildebrand | Stephan Vogel
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
Nguyen Bach | Qin Gao | Stephan Vogel
Proceedings of Machine Translation Summit XII: Papers

pdf bib
Reassessment of the Role of Phrase Extraction in PBSMT
Francisco Guzman | Qin Gao | Stephan Vogel
Proceedings of Machine Translation Summit XII: Papers

2008

pdf bib
Recent Improvements in the CMU Large Scale Chinese-English SMT System
Almut Silja Hildebrand | Kay Rottmann | Mohamed Noamany | Quin Gao | Sanjika Hewavitharana | Nguyen Bach | Stephan Vogel
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Discriminative Word Alignment via Alignment Matrix Modeling
Jan Niehues | Stephan Vogel
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Improving Word Alignment with Language Model Based Confidence Scores
Nguyen Bach | Qin Gao | Stephan Vogel
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Parallel Implementations of Word Alignment Tool
Qin Gao | Stephan Vogel
Software Engineering, Testing, and Quality Assurance for Natural Language Processing

pdf bib
Context-based Arabic Morphological Analysis for Machine Translation
ThuyLinh Nguyen | Stephan Vogel
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

pdf bib
Wider Pipelines: N-Best Alignments and Parses in MT Training
Ashish Venugopal | Andreas Zollmann | Noah A. Smith | Stephan Vogel
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOP-inspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.

pdf bib
Combination of Machine Translation Systems via Hypothesis Selection from Combined N-Best Lists
Almut Silja Hildebrand | Stephan Vogel
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality.

pdf bib
Diacritization as a Machine Translation and as a Sequence Labeling Problem
Tim Schlippe | ThuyLinh Nguyen | Stephan Vogel
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

pdf bib
Communicating Unknown Words in Machine Translation
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

A new approach to handle unknown words in machine translation is presented. The basic idea is to find definitions for the unknown words on the source language side and translate those definitions instead. Only monolingual resources are required, which generally offer a broader coverage than bilingual resources and are available for a large number of languages. In order to use this in a machine translation system definitions are extracted automatically from online dictionaries and encyclopedias. The translated definition is then inserted and clearly marked in the original hypothesis. This is shown to lead to significant improvements in (subjective) translation quality.

pdf bib
The CMU syntax-augmented machine translation system: SAMT on Hadoop with n-best alignments.
Andreas Zollmann | Ashish Venugopal | Stephan Vogel
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

We present the CMU Syntax Augmented Machine Translation System that was used in the IWSLT-08 evaluation campaign. We participated in the Full-BTEC data track for Chinese-English translation, focusing on transcript translation. For this year’s evaluation, we ported the Syntax Augmented MT toolkit [1] to the Hadoop MapReduce [2] parallel processing architecture, allowing us to efficiently run experiments evaluating a novel “wider pipelines” approach to integrate evidence from N -best alignments into our translation models. We describe each step of the MapReduce pipeline as it is implemented in the open-source SAMT toolkit, and show improvements in translation quality by using N-best alignments in both hierarchical and syntax augmented translation systems.

2007

pdf bib
Word reordering in statistical machine translation with a POS-based distortion model
Kay Rottmann | Stephan Vogel
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf bib
A Walk on the Other Side: Using SMT Components in a Transfer-Based Translation System
Ariadna Font Llitjós | Stephan Vogel
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf bib
The ISL Phrase-Based MT System for the 2007 ACL Workshop on Statistical Machine Translation
Matthias Paulik | Kay Rottmann | Jan Niehues | Silja Hildebrand | Stephan Vogel
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
The Syntax Augmented MT (SAMT) System at the Shared Task for the 2007 ACL Workshop on Statistical Machine Translation
Andreas Zollmann | Ashish Venugopal | Matthias Paulik | Stephan Vogel
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
The CMU TransTac 2007 eyes-free two-way speech-to-speech translation system
Nguyen Bach | Matthais Eck | Paisarn Charoenpornsawat | Thilo Köhler | Sebastian Stüker | ThuyLinh Nguyen | Roger Hsiao | Alex Waibel | Stephan Vogel | Tanja Schultz | Alan W. Black
Proceedings of the Fourth International Workshop on Spoken Language Translation

The paper describes our portable two-way speech-to-speech translation system using a completely eyes-free/hands-free user interface. This system translates between the language pair English and Iraqi Arabic as well as between English and Farsi, and was built within the framework of the DARPA TransTac program. The Farsi language support was developed within a 90-day period, testing our ability to rapidly support new languages. The paper gives an overview of the system’s components along with the individual component objective measures and a discussion of issues relevant for the overall usage of the system. We found that usability, flexibility, and robustness serve as severe constraints on system architecture and design.

pdf bib
The CMU-UKA statistical machine translation systems for IWSLT 2007
Ian Lane | Andreas Zollmann | Thuy Linh Nguyen | Nguyen Bach | Ashish Venugopal | Stephan Vogel | Kay Rottmann | Ying Zhang | Alex Waibel
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes the CMU-UKA statistical machine translation systems submitted to the IWSLT 2007 evaluation campaign. Systems were submitted for three language-pairs: Japanese→English, Chinese→English and Arabic→English. All systems were based on a common phrase-based SMT (statistical machine translation) framework but for each language-pair a specific research problem was tackled. For Japanese→English we focused on two problems: first, punctuation recovery, and second, how to incorporate topic-knowledge into the translation framework. Our Chinese→English submission focused on syntax-augmented SMT and for the Arabic→English task we focused on incorporating morphological-decomposition into the SMT framework. This research strategy enabled us to evaluate a wide variety of approaches which proved effective for the language pairs they were evaluated on.

pdf bib
A Log-Linear Block Transliteration Model based on Bi-Stream HMMs
Bing Zhao | Nguyen Bach | Ian Lane | Stephan Vogel
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
An Efficient Two-Pass Approach to Synchronous-CFG Driven Statistical MT
Ashish Venugopal | Andreas Zollmann | Stephan Vogel
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
Translation Model Pruning via Usage Statistics for Statistical Machine Translation
Matthias Eck | Stephan Vogel | Alex Waibel
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Enhancing image-based Arabic document translation using noisy channel correction model
Yi Chang | Ying Zhang | Stephan Vogel | Jie Yang
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Estimating phrase pair relevance for translation model pruning
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Experiments with a noun-phrase driven statistical machine translation system
Sanjika Hewavitharana | Alon Lavie | Stephan Vogel
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Iterative refinement of lexicon and phrasal alignment
Jae Dong Kim | Stephan Vogel
Proceedings of Machine Translation Summit XI: Papers

pdf bib
PanDoRA: a large-scale two-way statistical machine translation system for hand-held devices
Ying Zhang | Stephan Vogel
Proceedings of Machine Translation Summit XI: Papers

2006

pdf bib
The UKA/CMU statistical machine translation system for IWSLT 2006
Matthias Eck | Ian Lane | Nguyen Bach | Sanjika Hewavitharana | Muntsin Kolss | Bing Zhao | Almut Silja Hildebrand | Stephan Vogel | Alex Waibel
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
The CMU-UKA syntax augmented machine translation system for IWSLT-06
Andreas Zollmann | Ashish Venugopal | Stephan Vogel | Alex Waibel
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation
Andreas Zollmann | Ashish Venugopal | Stephan Vogel
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
Distributed Language Modeling for N-best List Re-ranking
Ying Zhang | Almut Silja Hildebrand | Stephan Vogel
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Flexible Online Server for Machine Translation Evaluation
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf bib
Statistical Machine Translation Part I: Hands-On Introduction
Stephan Vogel
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
The CMU Statistical Machine Translation System for IWSLT2005
Sanjika Hewavitharana | Bing Zhao | Hildebrand | Almut Silja | Matthias Eck | Chiori Hori | Stephan Vogel | Alex Waibel
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib
Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib
Low Cost Portability for Statistical Machine Translation based on N-gram Coverage
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of Machine Translation Summit X: Papers

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.

pdf bib
PESA: Phrase Pair Extraction as Sentence Splitting
Stephan Vogel
Proceedings of Machine Translation Summit X: Papers

Most statistical machine translation systems use phrase-to-phrase translations to capture local context information, leading to better lexical choice and more reliable local reordering. The quality of the phrase alignment is crucial to the quality of the resulting translations. Here, we propose a new phrase alignment method, not based on the Viterbi path of word alignment models. Phrase alignment is viewed as a sentence splitting task. For a given spitting of the source sentence (source phrase, left segment, right segment) find a splitting for the target sentence, which optimizes the overall sentence alignment probability. Experiments on different translation tasks show that this phrase alignment method leads to highly competitive translation results.

pdf bib
A Generalized Alignment-Free Phrase Extraction
Bing Zhao | Stephan Vogel
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf bib
Competitive Grouping in Integrated Phrase Segmentation and Alignment Model
Ying Zhang | Stephan Vogel
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf bib
Augmenting a statistical translation system with a translation memory
Sanjika Hewavitharana | Stephan Vogel | Alex Waibel
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
Adaptation of the translation model for statistical machine translation based on information retrieval
Almut Silja Hildebrand | Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
Considerations in maximum mutual information and minimum classification error training for statistical machine translation
Ashish Vengupol | Stephan Vogel
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora
Ying Zhang | Stephan Vogel
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
Mining Key Phrase Translations from Web Corpora
Fei Huang | Ying Zhang | Stephan Vogel
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
A trainable transfer-based MT approach for languages with limited resources
Alon Lavie | Katharina Probst | Erik Peterson | Stephan Vogel | Lori Levin | Ariadna Font-Llitjos | Jaime Carbonell
Proceedings of the 9th EAMT Workshop: Broadening horizons of machine translation and its applications

pdf bib
Language Model Adaptation for Statistical Machine Translation via Structured Query Models
Bing Zhao | Matthias Eck | Stephan Vogel
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Improving Statistical Machine Translation in the Medical Domain using the Unified Medical Language system
Matthias Eck | Stephan Vogel | Alex Waibel
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Improving Named Entity Translation Combining Phonetic and Semantic Similarities
Fei Huang | Stephan Vogel | Alex Waibel
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Phrase Pair Rescoring with Term Weighting for Statistical Machine Translation
Bing Zhao | Stephan Vogel | Matthias Eck | Alex Waibel
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
The ISL statistical translation system for spoken language translation
Stephan Vogel | Sanjika Hewavitharana | Muntsin Kolss | Alex Waibel
Proceedings of the First International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Toward named entity extraction and translation in spoken language translation
Fei Huang | Stephan Vogel | Alex Waibel
Proceedings of the First International Workshop on Spoken Language Translation: Papers

pdf bib
Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Augmenting Manual Dictionaries for Statistical Machine Translation Systems
Stephan Vogel | Christian Monson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?
Ying Zhang | Stephan Vogel | Alex Waibel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Measuring confidence intervals for the machine translation evaluation metrics
Ying Zhang | Stephan Vogel
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2003

pdf bib
The CMU statistical machine translation system
Stephan Vogel | Ying Zhang | Fei Huang | Alicia Tribble | Ashish Venugopal | Bing Zhao | Alex Waibel
Proceedings of Machine Translation Summit IX: Papers

In this paper we describe the components of our statistical machine translation system. This system combines phrase-to-phrase translations extracted from a bilingual corpus using different alignment approaches. Special methods to extract and align named entities are used. We show how a manual lexicon can be incorporated into the statistical system in an optimized way. Experiments on Chinese-to-English and Arabic-to-English translation tasks are presented.

bib
SMTTIDES – and all that
Stephan Vogel
Workshop on Machine Translation for Semitic languages: issues and approaches

bib
The CMU Arabic-to-English statistical MT system
Alicia Tribble | Stephan Vogel
Workshop on Machine Translation for Semitic languages: issues and approaches

pdf bib
Effective Phrase Translation Extraction from Alignment Models
Ashish Venugopal | Stephan Vogel | Alex Waibel
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Using Noisy Biligual Data for Statistical Machine Translation
Stephan Vogel
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Word Alignment Based on Bilingual Bracketing
Bing Zhao | Stephan Vogel
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression
Bing Zhao | Klaus Zechner | Stephen Vogel | Alex Waibel
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-Feature Cost Minimization
Fei Huang | Stephan Vogel | Alex Waibel
Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition

2001

pdf bib
The RWTH System for Statistical Translation of Spoken Dialogues
H. Ney | F. J. Och | S. Vogel
Proceedings of the First International Conference on Human Language Technology Research

2000

pdf bib
Translation with Cascaded Finite State Transducers
Stephan Vogel | Hermann Ney
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Construction of a Hierarchical Translation Memory
S. Vogel | H. Ney
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1998

pdf bib
A DP based Search Algorithm for Statistical Machine Translation
S. Nießen | S. Vogel | H. Ney | C. Tillmann
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
A DP based Search Algorithm for Statistical Machine Translation
S. Nießen | S. Vogel | H. Ney | C. Tillmann
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

1997

pdf bib
A DP-based Search Using Monotone Alignments in Statistical Translation
Christoph Tillmann | Stephan Vogel | Hermann Ney | Alex Zubiaga
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

1996

pdf bib
HMM-Based Word Alignment in Statistical Translation
Stephan Vogel | Hermann Ney | Christoph Tillmann
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

Search
Co-authors