Xi Zhou


2025

pdf bib
OpenForecast: A Large-Scale Open-Ended Event Forecasting Dataset
Zhen Wang | Xi Zhou | Yating Yang | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar
Proceedings of the 31st International Conference on Computational Linguistics

Complex events generally exhibit unforeseen, multifaceted, and multi-step developments, and cannot be well handled by existing closed-ended event forecasting methods, which are constrained by a limited answer space. In order to accelerate the research on complex event forecasting, we introduce OpenForecast, a large-scale open-ended dataset with two features: (1) OpenForecast defines three open-ended event forecasting tasks, enabling unforeseen, multifaceted, and multi-step forecasting. (2) OpenForecast collects and annotates a large-scale dataset from Wikipedia and news, including 43,419 complex events spanning from 1950 to 2024. Particularly, this annotation can be completed automatically without any manual annotation cost. Meanwhile, we introduce an automatic LLM-based Retrieval-Augmented Evaluation method (LRAE) for complex events, enabling OpenForecast to evaluate the ability of complex event forecasting of large language models. Finally, we conduct comprehensive human evaluations to verify the quality and challenges of OpenForecast, and the consistency between LEAE metric and human evaluation. OpenForecast and related codes will be publicly released.

pdf bib
Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur
Kaiwen Lu | Yating Yang | Fengyi Yang | Rui Dong | Bo Ma | Aihetamujiang Aihemaiti | Abibilla Atawulla | Lei Wang | Xi Zhou
Proceedings of the 31st International Conference on Computational Linguistics

Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.

2021

pdf bib
Relation-aware Video Reading Comprehension for Temporal Language Grounding
Jialin Gao | Xin Sun | Mengmeng Xu | Xi Zhou | Bernard Ghanem
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.

2018

pdf bib
Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the 27th International Conference on Computational Linguistics

To enrich vocabulary of low resource settings, we proposed a novel method which identify loanwords in monolingual corpora. More specifically, we first use cross-lingual word embeddings as the core feature to generate semantically related candidates based on comparable corpora and a small bilingual lexicon; then, a log-linear model which combines several shallow features such as pronunciation similarity and hybrid language model features to predict the final results. In this paper, we use Uyghur as the receipt language and try to detect loanwords in four donor languages: Arabic, Chinese, Persian and Russian. We conduct two groups of experiments to evaluate the effectiveness of our proposed approach: loanword identification and OOV translation in four language pairs and eight translation directions (Uyghur-Arabic, Arabic-Uyghur, Uyghur-Chinese, Chinese-Uyghur, Uyghur-Persian, Persian-Uyghur, Uyghur-Russian, and Russian-Uyghur). Experimental results on loanword identification show that our method outperforms other baseline models significantly. Neural machine translation models integrating results of loanword identification experiments achieve the best results on OOV translation(with 0.5-0.9 BLEU improvements)

pdf bib
A Neural Network Based Model for Loanword Identification in Uyghur
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Log-linear Models for Uyghur Segmentation in Spoken Language Translation
Chenggang Mi | Yating Yang | Rui Dong | Xi Zhou | Lei Wang | Xiao Li | Tonghai Jiang
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

To alleviate data sparsity in spoken Uyghur machine translation, we proposed a log-linear based morphological segmentation approach. Instead of learning model only from monolingual annotated corpus, this approach optimizes Uyghur segmentation for spoken translation based on both bilingual and monolingual corpus. Our approach relies on several features such as traditional conditional random field (CRF) feature, bilingual word alignment feature and monolingual suffixword co-occurrence feature. Experimental results shown that our proposed segmentation model for Uyghur spoken translation achieved 1.6 BLEU score improvements compared with the state-of-the-art baseline.

2016

pdf bib
A Bilingual Discourse Corpus and Its Applications
Yang Liu | Jiajun Zhang | Chengqing Zong | Yating Yang | Xi Zhou
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Existing discourse research only focuses on the monolingual languages and the inconsistency between languages limits the power of the discourse theory in multilingual applications such as machine translation. To address this issue, we design and build a bilingual discource corpus in which we are currently defining and annotating the bilingual elementary discourse units (BEDUs). The BEDUs are then organized into hierarchical structures. Using this discourse style, we have annotated nearly 20K LDC sentences. Finally, we design a bilingual discourse based method for machine translation evaluation and show the effectiveness of our bilingual discourse annotations.

pdf bib
Recurrent Neural Network Based Loanwords Identification in Uyghur
Chenggang Mi | Yating Yang | Xi Zhou | Lei Wang | Xiao Li | Tonghai Jiang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers