Xiaosong Qiao


2024

pdf bib
HW-TSC 2024 Submission for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR)
Mengyao Piao | Su Chang | Yuang Li | Xiaosong Qiao | Xiaofeng Zhao | Yinglu Li | Min Zhang | Hao Yang
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. In this paper, we present the system of Huawei Translation Services Center (HW-TSC) for Task 1 of SemEval 2024, which aims to automatically measure the semantic relatedness of sentence pairs in African and Asian languages. The task dataset for this task covers about 14 different languages, These languages originate from five distinct language families and are predominantly spoken in Africa and Asia. For this shared task, we describe our proposed solutions, including ideas and the implementation steps of the task, as well as the outcomes of each experiment on the development dataset. To enhance the performance, we leverage these experimental outcomes and construct an ensemble one. Our results demonstrate that our system achieves impressive performance on test datasets in unsupervised track B and ranked first place for the Punjabi language pair.

pdf bib
HW-TSC at SemEval-2024 Task 5: Self-Eval? A Confident LLM System for Auto Prediction and Evaluation for the Legal Argument Reasoning Task
Xiaofeng Zhao | Xiaosong Qiao | Kaiwen Ou | Min Zhang | Su Chang | Mengyao Piao | Yuang Li | Yinglu Li | Ming Zhu | Yilun Liu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this article, we present an effective system for semeval-2024 task 5. The task involves assessing the feasibility of a given solution in civil litigation cases based on relevant legal provisions, issues, solutions, and analysis. This task demands a high level of proficiency in U.S. law and natural language reasoning. In this task, we designed a self-eval LLM system that simultaneously performs reasoning and self-assessment tasks. We created a confidence interval and a prompt instructing the LLM to output the answer to a question along with its confidence level. We designed a series of experiments to prove the effectiveness of the self-eval mechanism. In order to avoid the randomness of the results, the final result is obtained by voting on three results generated by the GPT-4. Our submission was conducted under zero-resource setting, and we achieved first place in the task with an F1-score of 0.8231 and an accuracy of 0.8673.

pdf bib
CB-Whisper: Contextual Biasing Whisper Using Open-Vocabulary Keyword-Spotting
Yuang Li | Yinglu Li | Min Zhang | Chang Su | Jiawei Yu | Mengyao Piao | Xiaosong Qiao | Miaomiao Ma | Yanqing Zhao | Hao Yang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare name entities, such as personal names, organizations and terminologies that are not frequently encountered in the training data. This paper presents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on OpenAI’s Whisper model that can recognize user-defined name entities by performing open-vocabulary keyword-spotting (KWS) before the decoder. The KWS module leverages text-to-speech (TTS) techniques and a convolutional neural network (CNN) classifier to match the features between the entities and the utterances. To integrate the recognized entities into the Whipser decoder and avoid hallucinations, we carefully crafted multiple prompts with spoken form hints. Experiments show that the KWS module based on Whisper encoder’s features can recognize unseen user-defined keywords effectively. More importantly, the proposed CB-Whisper substantially improves the mixed-error-rate (MER) and entity recall compared to the original Whisper model on three internal datasets and two publicly available datasets including Aishell and ACL datasets that cover English-only, Chinese-only, and code-switching scenarios.

pdf bib
HW-TSC at TextGraphs-17 Shared Task: Enhancing Inference Capabilities of LLMs with Knowledge Graphs
Wei Tang | Xiaosong Qiao | Xiaofeng Zhao | Min Zhang | Chang Su | Yuang Li | Yinglu Li | Yilun Liu | Feiyu Yao | Shimin Tao | Hao Yang | He Xianghui
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

In this paper, we present an effective method for TextGraphs-17 Shared Task. This task requires selecting an entity from the candidate entities that is relevant to the given question and answer. The selection process is aided by utilizing the shortest path graph in the knowledge graph, connecting entities in the query to the candidate entity. This task aims to explore how to enhance LLMs output with KGs, although current LLMs have certain logical reasoning capabilities, they may not be certain about their own outputs, and the answers they produce may be correct by chance through incorrect paths. In this case, we have introduced a LLM prompt design strategy based on self-ranking and emotion. Specifically, we let the large model score its own answer choices to reflect its confidence in the answer. Additionally, we add emotional incentives to the prompts to encourage the model to carefully examine the questions. Our submissions was conducted under zero-resource setting, and we achieved the second place in the task with an F1-score of 0.8321.

2023

pdf bib
SmartSpanNER: Making SpanNER Robust in Low Resource Scenarios
Min Zhang | Xiaosong Qiao | Yanqing Zhao | Shimin Tao | Hao Yang
Findings of the Association for Computational Linguistics: EMNLP 2023

Named Entity Recognition (NER) is one of the most fundamental tasks in natural language processing. Span-level prediction (SpanNER) is more naturally suitable for nested NER than sequence labeling (SeqLab). However, according to our experiments, the SpanNER method is more sensitive to the amount of training data, i.e., the F1 score of SpanNER drops much more than that of SeqLab when the amount of training data drops. In order to improve the robustness of SpanNER in low resource scenarios, we propose a simple and effective method SmartSpanNER, which introduces a Named Entity Head (NEH) prediction task to SpanNER and performs multi-task learning together with the task of span classification. Experimental results demonstrate that the robustness of SpanNER could be greatly improved by SmartSpanNER in low resource scenarios constructed on the CoNLL03, Few-NERD, GENIA and ACE05 standard benchmark datasets.

pdf bib
Leveraging Multilingual Knowledge Graph to Boost Domain-specific Entity Translation of ChatGPT
Min Zhang | Limin Liu | Zhao Yanqing | Xiaosong Qiao | Su Chang | Xiaofeng Zhao | Junhao Zhu | Ming Zhu | Song Peng | Yinglu Li | Yilun Liu | Wenbing Ma | Mengyao Piao | Shimin Tao | Hao Yang | Yanfei Jiang
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

Recently, ChatGPT has shown promising results for Machine Translation (MT) in general domains and is becoming a new paradigm for translation. In this paper, we focus on how to apply ChatGPT to domain-specific translation and propose to leverage Multilingual Knowledge Graph (MKG) to help ChatGPT improve the domain entity translation quality. To achieve this, we extract the bilingual entity pairs from MKG for the domain entities that are recognized from source sentences. We then introduce these pairs into translation prompts, instructing ChatGPT to use the correct translations of the domain entities. To evaluate the novel MKG method for ChatGPT, we conduct comparative experiments on three Chinese-English (zh-en) test datasets constructed from three specific domains, of which one domain is from biomedical science, and the other two are from the Information and Communications Technology (ICT) industry — Visible Light Communication (VLC) and wireless domains. Experimental results demonstrate that both the overall translation quality of ChatGPT (+6.21, +3.13 and +11.25 in BLEU scores) and the translation accuracy of domain entities (+43.2%, +30.2% and +37.9% absolute points) are significantly improved with MKG on the three test datasets.

pdf bib
Empowering a Metric with LLM-assisted Named Entity Annotation: HW-TSC’s Submission to the WMT23 Metrics Shared Task
Zhanglin Wu | Yilun Liu | Min Zhang | Xiaofeng Zhao | Junhao Zhu | Ming Zhu | Xiaosong Qiao | Jingfei Zhang | Ma Miaomiao | Zhao Yanqing | Song Peng | Shimin Tao | Hao Yang | Yanfei Jiang
Proceedings of the Eighth Conference on Machine Translation

This paper presents the submission of Huawei Translation Service Center (HW-TSC) to the WMT23 metrics shared task, in which we submit two metrics: KG-BERTScore and HWTSC-EE-Metric. Among them, KG-BERTScore is our primary submission for the reference-free metric, which can provide both segment-level and system-level scoring. While HWTSC-EE-Metric is our primary submission for the reference-based metric, which can only provide system-level scoring. Overall, our metrics show relatively high correlations with MQM scores on the metrics tasks of previous years. Especially on system-level scoring tasks, our metrics achieve new state-of-the-art in many language pairs.

pdf bib
HW-TSC at SemEval-2023 Task 7: Exploring the Natural Language Inference Capabilities of ChatGPT and Pre-trained Language Model for Clinical Trial
Xiaofeng Zhao | Min Zhang | Miaomiao Ma | Chang Su | Yilun Liu | Minghan Wang | Xiaosong Qiao | Jiaxin Guo | Yinglu Li | Wenbing Ma
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this paper, we describe the multi strategy system for SemEval-2022 Task 7, This task aims to determine whether a given statement is supported by one or two Clinical Trial reports, and to identify evidence that supports the statement. This is a task that requires high natural language inference capabilities. In Subtask 1, we compare our strategy based on prompt learning and ChatGPT with a baseline constructed using BERT in zero-shot setting, and validate the effectiveness of our strategy. In Subtask 2, we fine-tune DeBERTaV3 for classification without relying on the results from Subtask 1, and we observe that early stopping can effectively prevent model overfitting, which performs well in Subtask 2. In addition, we did not use any ensemble strategies. Ultimately, we achieved the 10th place in Subtask 1 and the 2nd place in Subtask 2.

2022

pdf bib
HW-TSC at SemEval-2022 Task 3: A Unified Approach Fine-tuned on Multilingual Pretrained Model for PreTENS
Yinglu Li | Min Zhang | Xiaosong Qiao | Minghan Wang
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In the paper, we describe a unified system for task 3 of SemEval-2022. The task aims to recognize the semantic structures of sentences by providing two nominal arguments and to evaluate the degree of taxonomic relations. We utilise the strategy that adding language prefix tag in the training set, which is effective for the model. We split the training set to avoid the translation information to be learnt by the model. For the task, we propose a unified model fine-tuned on the multilingual pretrained model, XLM-RoBERTa. The model performs well in subtask 1 (the binary classification subtask). In order to verify whether our model could also perform better in subtask 2 (the regression subtask), the ranking score is transformed into classification labels by an up-sampling strategy. With the ensemble strategy, the performance of our model can be also improved. As a result, the model obtained the second place for subtask 1 and subtask 2 in the competition evaluation.

pdf bib
HW-TSC at SemEval-2022 Task 7: Ensemble Model Based on Pretrained Models for Identifying Plausible Clarifications
Xiaosong Qiao | Yinglu Li | Min Zhang | Minghan Wang | Hao Yang | Shimin Tao | Qin Ying
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes the system for the identifying Plausible Clarifications of Implicit and Underspecified Phrases. This task was set up as an English cloze task, in which clarifications are presented as possible fillers and systems have to score how well each filler plausibly fits in a given context. For this shared task, we propose our own solutions, including supervised proaches, unsupervised approaches with pretrained models, and then we use these models to build an ensemble model. Finally we get the 2nd best result in the subtask1 which is a classification task, and the 3rd best result in the subtask2 which is a regression task.

pdf bib
The HW-TSC’s Offline Speech Translation System for IWSLT 2022 Evaluation
Yinglu Li | Minghan Wang | Jiaxin Guo | Xiaosong Qiao | Yuxia Wang | Daimeng Wei | Chang Su | Yimeng Chen | Min Zhang | Shimin Tao | Hao Yang | Ying Qin
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper describes the HW-TSC’s designation of the Offline Speech Translation System submitted for IWSLT 2022 Evaluation. We explored both cascade and end-to-end system on three language tracks (en-de, en-zh and en-ja), and we chose the cascade one as our primary submission. For the automatic speech recognition (ASR) model of cascade system, there are three ASR models including Conformer, S2T-Transformer and U2 trained on the mixture of five datasets. During inference, transcripts are generated with the help of domain controlled generation strategy. Context-aware reranking and ensemble based anti-interference strategy are proposed to produce better ASR outputs. For machine translation part, we pretrained three translation models on WMT21 dataset and fine-tuned them on in-domain corpora. Our cascade system shows competitive performance than the known offline systems in the industry and academia.

pdf bib
The HW-TSC’s Simultaneous Speech Translation System for IWSLT 2022 Evaluation
Minghan Wang | Jiaxin Guo | Yinglu Li | Xiaosong Qiao | Yuxia Wang | Zongyao Li | Chang Su | Yimeng Chen | Min Zhang | Shimin Tao | Hao Yang | Ying Qin
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper presents our work in the participation of IWSLT 2022 simultaneous speech translation evaluation. For the track of text-to-text (T2T), we participate in three language pairs and build wait-k based simultaneous MT (SimulMT) model for the task. The model was pretrained on WMT21 news corpora, and was further improved with in-domain fine-tuning and self-training. For the speech-to-text (S2T) track, we designed both cascade and end-to-end form in three language pairs. The cascade system is composed of a chunking-based streaming ASR model and the SimulMT model used in the T2T track. The end-to-end system is a simultaneous speech translation (SimulST) model based on wait-k strategy, which is directly trained on a synthetic corpus produced by translating all texts of ASR corpora into specific target language with an offline MT model. It also contains a heuristic sentence breaking strategy, preventing it from finishing the translation before the the end of the speech. We evaluate our systems on the MUST-C tst-COMMON dataset and show that the end-to-end system is competitive to the cascade one. Meanwhile, we also demonstrate that the SimulMT model can be efficiently optimized by these approaches, resulting in the improvements of 1-2 BLEU points.

pdf bib
The HW-TSC’s Speech to Speech Translation System for IWSLT 2022 Evaluation
Jiaxin Guo | Yinglu Li | Minghan Wang | Xiaosong Qiao | Yuxia Wang | Hengchao Shang | Chang Su | Yimeng Chen | Min Zhang | Shimin Tao | Hao Yang | Ying Qin
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

The paper presents the HW-TSC’s pipeline and results of Offline Speech to Speech Translation for IWSLT 2022. We design a cascade system consisted of an ASR model, machine translation model and TTS model to convert the speech from one language into another language(en-de). For the ASR part, we find that better performance can be obtained by ensembling multiple heterogeneous ASR models and performing reranking on beam candidates. And we find that the combination of context-aware reranking strategy and MT model fine-tuned on the in-domain dataset is helpful to improve the performance. Because it can mitigate the problem that the inconsistency in transcripts caused by the lack of context. Finally, we use VITS model provided officially to reproduce audio files from the translation hypothesis.

pdf bib
Partial Could Be Better than Whole. HW-TSC 2022 Submission for the Metrics Shared Task
Yilun Liu | Xiaosong Qiao | Zhanglin Wu | Su Chang | Min Zhang | Yanqing Zhao | Song Peng | Shimin Tao | Hao Yang | Ying Qin | Jiaxin Guo | Minghan Wang | Yinglu Li | Peng Li | Xiaofeng Zhao
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we present the contribution of HW-TSC to WMT 2022 Metrics Shared Task. We propose one reference-based metric, HWTSC-EE-BERTScore*, and four referencefree metrics including HWTSC-Teacher-Sim, HWTSC-TLM, KG-BERTScore and CROSSQE. Among these metrics, HWTSC-Teacher-Sim and CROSS-QE are supervised, whereas HWTSC-EE-BERTScore*, HWTSC-TLM and KG-BERTScore are unsupervised. We use these metrics in the segment-level and systemlevel tracks. Overall, our systems achieve strong results for all language pairs on previous test sets and a new state-of-the-art in many sys-level case sets.