2024
pdf
bib
Findings of the Association for Computational Linguistics: NAACL 2024
Kevin Duh
|
Helena Gomez
|
Steven Bethard
Findings of the Association for Computational Linguistics: NAACL 2024
pdf
bib
abs
iimasNLP at SemEval-2024 Task 8: Unveiling structure-aware language models for automatic generated text identification
Andric Valdez
|
Fernando Márquez
|
Jorge Pantaleón
|
Helena Gómez
|
Gemma Bel-enguix
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Large language models (LLMs) are artificial intelligence systems that can generate text, translate languages, and answer questions in a human-like way. While these advances are impressive, there is concern that LLMs could also be used to generate fake or misleading content. In this work, as a part of our participation in SemEval-2024 Task-8, we investigate the ability of LLMs to identify whether a given text was written by a human or by a specific AI. We believe that human and machine writing style patterns are different from each other, so integrating features at different language levels can help in this classification task. For this reason, we evaluate several LLMs that aim to extract valuable multilevel information (such as lexical, semantic, and syntactic) from the text in their training processing. Our best scores on Sub- taskA (monolingual) and SubtaskB were 71.5% and 38.2% in accuracy, respectively (both using the ConvBERT LLM); for both subtasks, the baseline (RoBERTa) achieved an accuracy of 74%.
pdf
bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Kevin Duh
|
Helena Gomez
|
Steven Bethard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
pdf
bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Kevin Duh
|
Helena Gomez
|
Steven Bethard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
2019
bib
abs
A Parallel Corpus Mixtec-Spanish
Cynthia Montaño
|
Gerardo Sierra Martínez
|
Gemma Bel-Enguix
|
Helena Gomez
Proceedings of the 2019 Workshop on Widening NLP
This work is about the compilation process of parallel documents Spanish-Mixtec. There are not many Spanish-Mixec parallel texts and most of the sources are non-digital books. Due to this, we need to face the errors when digitizing the sources and difficulties in sentence alignment, as well as the fact that does not exist a standard orthography. Our parallel corpus consists of sixty texts coming from books and digital repositories. These documents belong to different domains: history, traditional stories, didactic material, recipes, ethnographical descriptions of each town and instruction manuals for disease prevention. We have classified this material in five major categories: didactic (6 texts), educative (6 texts), interpretative (7 texts), narrative (39 texts), and poetic (2 texts). The final total of tokens is 49,814 Spanish words and 47,774 Mixtec words. The texts belong to the states of Oaxaca (48 texts), Guerrero (9 texts) and Puebla (3 texts). According to this data, we see that the corpus is unbalanced in what refers to the representation of the different territories. While 55% of speakers are in Oaxaca, 80% of texts come from this region. Guerrero has the 30% of speakers and the 15% of texts and Puebla, with the 15% of the speakers has a representation of the 5% in the corpus.
2017
pdf
bib
abs
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez
|
Ilia Markov
|
Jorge Baptista
|
Grigori Sidorov
|
David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
2016
pdf
bib
CICBUAPnlp at SemEval-2016 Task 4-A: Discovering Twitter Polarity using Enhanced Embeddings
Helena Gomez
|
Darnes Vilariño
|
Grigori Sidorov
|
David Pinto Avendaño
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
2015
pdf
bib
CICBUAPnlp: Graph-Based Approach for Answer Selection in Community Question Answering Task
Helena Gomez
|
Darnes Vilariño
|
David Pinto
|
Grigori Sidorov
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2013
pdf
bib
BUAP: N-gram based Feature Evaluation for the Cross-Lingual Textual Entailment Task
Darnes Vilariño
|
David Pinto
|
Saúl León
|
Yuridiana Alemán
|
Helena Gómez
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)