Shanbo Cheng


2024

pdf bib
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation
Jiahuan Li | Shanbo Cheng | Shujian Huang | Jiajun Chen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation, yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods ignore the capability of student and teacher models, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve MT performances on unseen contexts and words.

pdf bib
Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions
Jiahuan Li | Hao Zhou | Shujian Huang | Shanbo Cheng | Jiajun Chen
Transactions of the Association for Computational Linguistics, Volume 12

Large-scale pretrained language models (LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translation, without being explicitly trained on parallel corpora. It is intriguing how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7.5B, to perform multilingual translation following given instructions. Firstly, we show that multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language, the translation performance depends on its similarity to English and the amount of data used in the pretraining phase. Secondly, we find that LLMs’ ability to carry out translation instructions relies on the understanding of translation instructions and the alignment among different languages. With multilingual finetuning with translation instructions, LLMs could learn to perform the translation task well even for those language pairs unseen during the instruction tuning phase.

pdf bib
Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs
Zhiwei Cao | Qian Cao | Yu Lu | Ningxin Peng | Luyang Huang | Shanbo Cheng | Jinsong Su
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The growing popularity of Large Language Models has sparked interest in context compression for Large Language Models (LLMs). However, the performance of previous methods degrades dramatically as compression ratios increase, sometimes even falling to the closed-book level. This decline can be attributed to the loss of key information during the compression process. Our preliminary study supports this hypothesis, emphasizing the significance of retaining key information to maintain model performance under high compression ratios. As a result, we introduce Query-Guided Compressor (QGC), which leverages queries to guide the context compression process, effectively preserving key information within the compressed context. Additionally, we employ a dynamic compression strategy. We validate the effectiveness of our proposed QGC on the Question Answering task, including NaturalQuestions, TriviaQA, and HotpotQA datasets. Experimental results show that QGC can consistently perform well even at high compression ratios, which also offers significant benefits in terms of inference cost and throughput.

pdf bib
G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation
Xingyuan Pan | Luyang Huang | Liyan Kang | Zhicheng Liu | Yu Lu | Shanbo Cheng
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two main challenges for instruction finetuning. With regard to this, in this paper, we propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation. Our key innovation centers around analyzing how individual training examples influence the model during training. Specifically, we select training examples that exert beneficial influences on the model as high-quality ones by means of Influence Function plus a small high-quality seed dataset. Moreover, to enhance the diversity of the training data we maximize the variety of influences they have on the model by clustering on their gradients and resampling. Extensive experiments on WMT22 and FLORES translation tasks demonstrate the superiority of our methods, and in-depth analysis further validates their effectiveness and generalization.

2023

pdf bib
Controlling Styles in Neural Machine Translation with Activation Prompt
Yifan Wang | Zewei Sun | Shanbo Cheng | Weiguo Zheng | Mingxuan Wang
Findings of the Association for Computational Linguistics: ACL 2023

Controlling styles in neural machine translation (NMT) has attracted wide attention, as it is crucial for enhancing user experience. Earlier studies on this topic typically concentrate on regulating the level of formality and achieve some progress in this area. However, they still encounter two major challenges. The first is the difficulty in style evaluation. The style comprises various aspects such as lexis, syntax, and others that provide abundant information. Nevertheless, only formality has been thoroughly investigated. The second challenge involves excessive dependence on incremental adjustments, particularly when new styles are necessary. To address both challenges, this paper presents a new benchmark and approach. A multiway stylized machine translation (MSMT) benchmark is introduced, incorporating diverse categories of styles across four linguistic domains. Then, we propose a method named style activation prompt (StyleAP) by retrieving prompts from stylized monolingual corpus, which does not require extra fine-tuning. Experiments show that StyleAP could effectively control the style of translation and achieve remarkable performance.

pdf bib
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation
Yaoming Zhu | Zewei Sun | Shanbo Cheng | Luyang Huang | Liwei Wu | Mingxuan Wang
Findings of the Association for Computational Linguistics: ACL 2023

Multimodal machine translation (MMT) aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems focus on better access and use of visual information and tend to validate their methods on image-related datasets. However, these studies face two challenges. First, they can only utilize a limited amount of data that is composed of bilingual texts and images (referred to as “triple data”), which is scarce. Second, current benchmarks for MMT are restricted and do not correspond to realistic scenarios. Therefore, this paper correspondingly establishes new methods and a new dataset for MMT. We propose a novel framework for MMT that addresses these challenges by utilizing large-scale non-triple data, such as monolingual image-text and parallel text-only data. Additionally, we construct a new e-commercial multimodal translation dataset, named EMMT, of which the test set is specifically designed to include ambiguous words that require visual context for accurate translation. Experiments show that our method is well-suited for real-world scenarios and can significantly improve translation performance with more non-triple data. In addition, our model also rivals or surpasses various SOTA models in conventional multimodal translation benchmarks.

pdf bib
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation
Liyan Kang | Luyang Huang | Ningxin Peng | Peihao Zhu | Zewei Sun | Shanbo Cheng | Mingxuan Wang | Degen Huang | Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2023

We present a large-scale video subtitle translation dataset, *BigVideo*, to facilitate the study of multi-modality machine translation. Compared with the widely used *How2* and *VaTeX* datasets, *BigVideo* is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: *Ambiguous* with the presence of ambiguous words, and *Unambiguous* in which the text context is self-contained for translation. To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder. Extensive experiments on the *BigVideo* shows that: a) Visual information consistently improves the NMT model in terms of BLEU, BLEURT and COMET on both Ambiguous and Unambiguous test sets. b) Visual information helps disambiguation, compared to the strong text baseline on terminology-targeted scores and human evaluation.

pdf bib
Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation
Zihan Liu | Zewei Sun | Shanbo Cheng | Shujian Huang | Mingxuan Wang
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2021

pdf bib
The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21
Lihua Qian | Yi Zhou | Zaixiang Zheng | Yaoming Zhu | Zehui Lin | Jiangtao Feng | Shanbo Cheng | Lei Li | Mingxuan Wang | Hao Zhou
Proceedings of the Sixth Conference on Machine Translation

This paper describes the Volctrans’ submission to the WMT21 news translation shared task for German->English translation. We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer, which enables fast and accurate parallel decoding in contrast to the currently prevailing autoregressive models. To the best of our knowledge, this is the first parallel translation system that can be scaled to such a practical scenario like WMT competition. More importantly, our parallel translation system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.

pdf bib
Learning Kernel-Smoothed Machine Translation with Retrieved Examples
Qingnan Jiang | Mingxuan Wang | Jun Cao | Shanbo Cheng | Shujian Huang | Lei Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

How to effectively adapt neural machine translation (NMT) models according to emerging cases without retraining? Despite the great success of neural machine translation, updating the deployed models online remains a challenge. Existing non-parametric approaches that retrieve similar examples from a database to guide the translation process are promising but are prone to overfit the retrieved examples. However, non-parametric methods are prone to overfit the retrieved examples. In this work, we propose to learn Kernel-Smoothed Translation with Example Retrieval (KSTER), an effective approach to adapt neural machine translation models online. Experiments on domain adaptation and multi-domain machine translation datasets show that even without expensive retraining, KSTER is able to achieve improvement of 1.1 to 1.5 BLEU scores over the best existing online adaptation methods. The code and trained models are released at https://github.com/jiangqn/KSTER.

pdf bib
Language Tags Matter for Zero-Shot Neural Machine Translation
Liwei Wu | Shanbo Cheng | Mingxuan Wang | Lei Li
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Language-aware Interlingua for Multilingual Neural Machine Translation
Changfeng Zhu | Heng Yu | Shanbo Cheng | Weihua Luo
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Multilingual neural machine translation (NMT) has led to impressive accuracy improvements in low-resource scenarios by sharing common linguistic information across languages. However, the traditional multilingual model fails to capture the diversity and specificity of different languages, resulting in inferior performance compared with individual models that are sufficiently trained. In this paper, we incorporate a language-aware interlingua into the Encoder-Decoder architecture. The interlingual network enables the model to learn a language-independent representation from the semantic spaces of different languages, while still allowing for language-specific specialization of a particular language-pair. Experiments show that our proposed method achieves remarkable improvements over state-of-the-art multilingual NMT baselines and produces comparable performance with strong individual models.

2018

pdf bib
Alibaba’s Neural Machine Translation Systems for WMT18
Yongchao Deng | Shanbo Cheng | Jun Lu | Kai Song | Jingang Wang | Shenglan Wu | Liang Yao | Guchun Zhang | Haibo Zhang | Pei Zhang | Changfeng Zhu | Boxing Chen
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission systems of Alibaba for WMT18 shared news translation task. We participated in 5 translation directions including English ↔ Russian, English ↔ Turkish in both directions and English → Chinese. Our systems are based on Google’s Transformer model architecture, into which we integrated the most recent features from the academic research. We also employed most techniques that have been proven effective during the past WMT years, such as BPE, back translation, data selection, model ensembling and reranking, at industrial scale. For some morphologically-rich languages, we also incorporated linguistic knowledge into our neural network. For the translation tasks in which we have participated, our resulting systems achieved the best case sensitive BLEU score in all 5 directions. Notably, our English → Russian system outperformed the second reranked system by 5 BLEU score.

2017

pdf bib
Sogou Neural Machine Translation Systems for WMT17
Yuguang Wang | Shanbo Cheng | Liyang Jiang | Jiajun Yang | Wei Chen | Muze Li | Lin Shi | Yanfeng Wang | Hongtao Yang
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
PRIMT: A Pick-Revise Framework for Interactive Machine Translation
Shanbo Cheng | Shujian Huang | Huadong Chen | Xin-Yu Dai | Jiajun Chen
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies