Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus

Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, Sanjeev Khudanpur


Abstract
Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For SpanishEnglish translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callhome Spanish-English Speech Translation Corpus, supplementing existing LDC audio and transcripts with (a) ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and (b) English translations obtained on Amazon’s Mechanical Turk. The result is a four-way parallel dataset of Spanish audio, transcriptions, ASR lattices, and English translations of approximately 38 hours of speech, with defined training, development, and held-out test sets. We conduct baseline machine translation experiments using models trained on the provided training data, and validate the dataset by corroborating a number of known results in the field, including the utility of in-domain (information, conversational) training data, increased performance translating lattices (instead of recognizer 1-best output), and the relationship between word error rate and BLEU score.
Anthology ID:
2013.iwslt-papers.14
Volume:
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers
Month:
December 5-6
Year:
2013
Address:
Heidelberg, Germany
Editor:
Joy Ying Zhang
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
Language:
URL:
https://aclanthology.org/2013.iwslt-papers.14
DOI:
Bibkey:
Cite (ACL):
Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, and Sanjeev Khudanpur. 2013. Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus. In Proceedings of the 10th International Workshop on Spoken Language Translation: Papers, Heidelberg, Germany.
Cite (Informal):
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus (Post et al., IWSLT 2013)
Copy Citation:
PDF:
https://aclanthology.org/2013.iwslt-papers.14.pdf