Nam Le Hai

Also published as: Nam Le Hai


2025

pdf bib
On the Impacts of Contexts on Repository-Level Code Generation
Nam Le Hai | Dung Manh Nguyen | Nghi D. Q. Bui
Findings of the Association for Computational Linguistics: NAACL 2025

CodeLLMs are widely used for code generation, yet their ability to handle repository-level dependencies remains underexplored. We introduce RepoExec, a benchmark for evaluating repository-level code generation, focusing on executability, functional correctness, and dependency utilization. Our study evaluates 18 models, revealing that retaining full dependency context yields the best performance, while smaller context sizes can be misleading. Pretrained LLMs excel in correctness but often reimplement dependencies, while instruction-tuned models better utilize dependencies but sometimes introduce unnecessary complexity. We propose an instruction-tuning dataset that improves dependency handling and introduce a new metric, Dependency Invocation Rate (DIR), to measure context utilization. Experiments show that instruction-tuned models improve DIR by over 10%, and multi-round debugging further enhances both correctness and dependency use. RepoExec provides a comprehensive framework to advance CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.

pdf bib
Enhancing Discriminative Representation in Similar Relation Clusters for Few-Shot Continual Relation Extraction
Anh Duc Le | Nam Le Hai | Thanh Xuan Nguyen | Linh Ngo Van | Nguyen Thi Ngoc Diep | Sang Dinh | Thien Huu Nguyen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Few-shot Continual Relation Extraction (FCRE) has emerged as a significant challenge in information extraction, necessitating that relation extraction (RE) systems can sequentially identify new relations with limited labeled samples. While existing studies have demonstrated promising results in FCRE, they often overlook the issue of similar relations, which is a critical factor contributing to catastrophic forgetting. In this work, we propose Sirus–a novel method that utilizes relation descriptions and dynamic clustering on these descriptions to identify similar relations. Leveraging this information, we introduce innovative loss functions specifically designed to enhance the distinction between relations, with a focus on learning to differentiate similar ones. Experimental results show that our approach can effectively mitigate the problem of catastrophic forgetting and outperforms state-of-the-art methods by a large margin. Additionally, we explore the potential of Large Language Model Embeddings (LLMEs) with representation learning and embedding capabilities, demonstrating their promise for advancing FCRE systems.

pdf bib
Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains
Toan Ngoc Nguyen | Nam Le Hai | Nguyen Doan Hieu | Dai An Nguyen | Linh Ngo Van | Thien Huu Nguyen | Sang Dinh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Document retrieval plays a crucial role in numerous question-answering systems, yet research has concentrated on the general knowledge domain and resource-rich languages like English. In contrast, it remains largely underexplored in low-resource languages and cross-lingual scenarios within specialized domain knowledge such as legal. We present a novel dataset designed for cross-lingual retrieval between Vietnamese and English, which not only covers the general domain but also extends to the legal field. Additionally, we propose auxiliary loss function and symmetrical training strategy that significantly enhance the performance of state-of-the-art models on these retrieval tasks. Our contributions offer a significant resource and methodology aimed at improving cross-lingual retrieval in both legal and general QA settings, facilitating further advancements in document retrieval research across multiple languages and a broader spectrum of specialized domains. All the resources related to our work can be accessed at huggingface.co/datasets/bkai-foundation-models/crosslingual.

2024

pdf bib
Preserving Generalization of Language models in Few-shot Continual Relation Extraction
Quyen Tran | Nguyen Xuan Thanh | Nguyen Hoang Anh | Nam Le Hai | Trung Le | Linh Van Ngo | Thien Huu Nguyen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic area of study where models can sequentially integrate knowledge from new relations with limited labeled data while circumventing catastrophic forgetting and preserving prior knowledge from pre-trained backbones. In this work, we introduce a novel method that leverages often-discarded language model heads. By employing these components via a mutual information maximization strategy, our approach helps maintain prior knowledge from the pre-trained backbone and strategically aligns the primary classification head, thereby enhancing model performance. Furthermore, we explore the potential of Large Language Models (LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges. Our comprehensive experimental results underscore the efficacy of the proposed method and offer valuable insights for future work.

2023

pdf bib
CoSPLADE : Adaptation d’un Modèle Neuronal Basé sur des Représentations Parcimonieuses pour la Recherche d’Information Conversationnelle
Nam Le Hai | Thomas Gerald | Thibault Formal | Jian-Yun Nie | Benjamin Piwowarksi | Laure Soulier
Actes de CORIA-TALN 2023. Actes de la 18e Conférence en Recherche d'Information et Applications (CORIA)

La recherche conversationnelle est une tâche qui vise à retrouver des documents à partir de la questioncourante de l’utilisateur ainsi que l’historique complet de la conversation. La plupart des méthodesantérieures sont basées sur une approche multi-étapes reposant sur une reformulation de la question.Cette étape de reformulation est critique, car elle peut conduire à un classement sous-optimal des do-cuments. D’autres approches ont essayé d’ordonner directement les documents, mais s’appuient pourla plupart sur un jeu de données contenant des pseudo-labels. Dans ce travail, nous proposons une tech-nique d’apprentissage à la fois “légère” et innovante pour un modèle contextualisé d’ordonnancementbasé sur SPLADE. En s’appuyant sur les représentations parcimonieuses de SPLADE, nous montronsque notre modèle, lorsqu’il est combiné avec le modèle de ré-ordonnancement T5Mono, obtient desrésultats qui sont compétitifs avec ceux obtenus par les participants des campagnes d’évaluation TRECCAsT 2020 et 2021. Le code source est disponible sur https://github.com/anonymous.

pdf bib
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh | Nam Le Hai | Anh T. V. Dau | Anh Minh Nguyen | Khanh Nghiem | Jin Guo | Nghi D. Q. Bui
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)