Document-aligned Japanese-English Conversation Parallel Corpus

Matīss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa


Abstract
Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.
Anthology ID:
2020.wmt-1.74
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
639–645
Language:
URL:
https://aclanthology.org/2020.wmt-1.74
DOI:
Bibkey:
Cite (ACL):
Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa. 2020. Document-aligned Japanese-English Conversation Parallel Corpus. In Proceedings of the Fifth Conference on Machine Translation, pages 639–645, Online. Association for Computational Linguistics.
Cite (Informal):
Document-aligned Japanese-English Conversation Parallel Corpus (Rikters et al., WMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wmt-1.74.pdf
Optional supplementary material:
 2020.wmt-1.74.OptionalSupplementaryMaterial.pdf
Video:
 https://slideslive.com/38939560
Code
 tsuruoka-lab/AMI-Meeting-Parallel-Corpus
Data
Business Scene DialogueJParaCrawl