Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification

Katherine Yu, Haoran Li, Barlas Oguz


Abstract
In this paper we continue experiments where neural machine translation training is used to produce joint cross-lingual fixed-dimensional sentence embeddings. In this framework we introduce a simple method of adding a loss to the learning objective which penalizes distance between representations of bilingually aligned sentences. We evaluate cross-lingual transfer using two approaches, cross-lingual similarity search on an aligned corpus (Europarl) and cross-lingual document classification on a recently published benchmark Reuters corpus, and we find the similarity loss significantly improves performance on both. Furthermore, we notice that while our Reuters results are very competitive, our English results are not as competitive, showing room for improvement in the current cross-lingual state-of-the-art. Our results are based on a set of 6 European languages.
Anthology ID:
W18-3023
Volume:
Proceedings of the Third Workshop on Representation Learning for NLP
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Isabelle Augenstein, Kris Cao, He He, Felix Hill, Spandana Gella, Jamie Kiros, Hongyuan Mei, Dipendra Misra
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
175–179
Language:
URL:
https://aclanthology.org/W18-3023
DOI:
10.18653/v1/W18-3023
Bibkey:
Cite (ACL):
Katherine Yu, Haoran Li, and Barlas Oguz. 2018. Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 175–179, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification (Yu et al., RepL4NLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3023.pdf