2024
pdf
bib
abs
PCICUNAM at WASSA 2024: Cross-lingual Emotion Detection Task with Hierarchical Classification and Weighted Loss Functions
Jesús Vázquez-Osorio
|
Gerardo Sierra
|
Helena Gómez-Adorno
|
Gemma Bel-Enguix
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
This paper addresses the shared task of multi-lingual emotion detection in tweets, presented at the Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis (WASSA) co-located with the ACL 2024 conference. The task involves predicting emotions from six classes in tweets from five different languages using only English for model training. Our approach focuses on addressing class imbalance through data augmentation, hierarchical classification, and the application of focal loss and weighted cross-entropy loss functions. These methods enhance our transformer-based model’s ability to transfer emotion detection capabilities across languages, resulting in improved performance despite the constraints of limited computational resources.
pdf
bib
abs
PCIC at SMM4H 2024: Enhancing Reddit Post Classification on Social Anxiety Using Transformer Models and Advanced Loss Functions
Leon Hecht
|
Victor Pozos
|
Helena Gomez Adorno
|
Gibran Fuentes-Pineda
|
Gerardo Sierra
|
Gemma Bel-Enguix
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
We present our approach to solving the task of identifying the effect of outdoor activities on social anxiety based on reddit posts. We employed state-of-the-art transformer models enhanced with a combination of advanced loss functions. Data augmentation techniques were also used to address class imbalance within the training set. Our method achieved a macro-averaged F1-score of 0.655 on the test data, surpassing the workshop’s mean F1-Score of 0.519. These findings suggest that integrating weighted loss functions improves the performance of transformer models in classifying unbalanced text data, while data augmentation can improve the model’s ability to generalize.
pdf
bib
abs
MBZUAI-UNAM at SemEval-2024 Task 1: Sentence-CROBI, a Simple Cross-Bi-Encoder-Based Neural Network Architecture for Semantic Textual Relatedness
Jesus German Ortiz Barajas
|
Gemma Bel-enguix
|
Helena Goméz-adorno
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
The Semantic Textual Relatedness (STR) shared task aims at detecting the degree of semantic relatedness between pairs of sentences on low-resource languages from Afroasiatic, Indoeuropean, Austronesian, Dravidian, and Nigercongo families. We use the Sentence-CROBI architecture to tackle this problem. The model is adapted from its original purpose of paraphrase detection to explore its capacities in a related task with limited resources and in multilingual and monolingual settings. Our approach combines the vector representation of cross-encoders and bi-encoders and possesses high adaptable capacity by combining several pre-trained models. Our system obtained good results on the low-resource languages of the dataset using a multilingual fine-tuning approach.
2023
pdf
bib
abs
HOMO-MEX: A Mexican Spanish Annotated Corpus for LGBT+phobia Detection on Twitter
Juan Vásquez
|
Scott Andersen
|
Gemma Bel-enguix
|
Helena Gómez-adorno
|
Sergio-luis Ojeda-trueba
The 7th Workshop on Online Abuse and Harms (WOAH)
In the past few years, the NLP community has actively worked on detecting LGBT+Phobia in online spaces, using textual data publicly available Most of these are for the English language and its variants since it is the most studied language by the NLP community. Nevertheless, efforts towards creating corpora in other languages are active worldwide. Despite this, the Spanish language is an understudied language regarding digital LGBT+Phobia. The only corpus we found in the literature was for the Peninsular Spanish dialects, which use LGBT+phobic terms different than those in the Mexican dialect. For this reason, we present Homo-MEX, a novel corpus for detecting LGBT+Phobia in Mexican Spanish. In this paper, we describe our data-gathering and annotation process. Also, we present a classification benchmark using various traditional machine learning algorithms and two pre-trained deep learning models to showcase our corpus classification potential.
2020
pdf
bib
abs
Enhancing Job Searches in Mexico City with Language Technologies
Gerardo Sierra Martínez
|
Gemma Bel-Enguix
|
Helena Gómez-Adorno
|
Juan Manuel Torres Moreno
|
Tonatiuh Hernández-García
|
Julio V Guadarrama-Olvera
|
Jesús-Germán Ortiz-Barajas
|
Ángela María Rojas
|
Tomas Damerau
|
Soledad Aragón Martínez
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnóstico de Competencias Demandadas), a system developed by Mexico City’s Ministry of Labor and Employment Promotion (STyFE: Secretaría de Trabajo y Fomento del Empleo de la Ciudad de México) that seeks to reduce information asymmetries between job seekers and employers. The project uses webscraping techniques to retrieve job vacancies posted on private job portals on a daily basis and with the purpose of informing training and individual case management policies as well as labor market monitoring. For this purpose, a collaboration project between STyFE and the Language Engineering Group (GIL: Grupo de Ingeniería Lingüística) was established in order to enhance DiCoDe by applying NLP models and semantic analysis. By this collaboration, DiCoDe’s job vacancies system’s macro-structure and its geographic referencing at the city hall (municipality) level were improved. More specifically, dictionaries were created to identify demanded competencies, skills and abilities (CSA) and algorithms were developed for dynamic classifying of vacancies and identifying terms for searches on free text, in order to improve the results and processing time of queries.
pdf
bib
abs
Automatic Word Association Norms (AWAN)
Jorge Reyes-Magaña
|
Gerardo Sierra Martínez
|
Gemma Bel-Enguix
|
Helena Gomez-Adorno
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon
Word Association Norms (WAN) are collections that present stimuli words and the set of their associated responses. The corpus is widely used in diverse areas of expertise. In order to reduce the effort to have a good quality resource that can be reproduced in many languages with minimum sources, a methodology to build Automatic Word Association Norms is proposed (AWAN). The methodology has an input of two simple elements: a) dictionary, and b) pre-processed Word Embeddings. This new kind of WAN is evaluated in two ways: i) learning word embeddings based on the node2vec algorithm and comparing them with human annotated benchmarks, and ii) performing a lexical search for a reverse dictionary. Both evaluations are done in a weighted graph with the AWAN lexical elements. The results showed that the methodology produces good quality AWANs.
pdf
bib
abs
MineriaUNAM at SemEval-2020 Task 3: Predicting Contextual WordSimilarity Using a Centroid Based Approach and Word Embeddings
Helena Gomez-Adorno
|
Gemma Bel-Enguix
|
Jorge Reyes-Magaña
|
Benjamín Moreno
|
Ramón Casillas
|
Daniel Vargas
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper presents our systems to solve Task 3 of Semeval-2020, which aims to predict the effect that context has on human perception of similarity of words. The task consists of two subtasks in English, Croatian, Finnish, and Slovenian: (1) predicting the change of similarity and (2) predicting the human scores of similarity, both of them for a pair of words within two different contexts. We tackled the problem by developing two systems, the first one uses a centroid approach and word vectors. The second one uses the ELMo language model, which is trained for each pair of words with the given context. Our approach achieved the highest score in subtask 2 for the English language.
2019
pdf
bib
abs
MineriaUNAM at SemEval-2019 Task 5: Detecting Hate Speech in Twitter using Multiple Features in a Combinatorial Framework
Luis Enrique Argota Vega
|
Jorge Carlos Reyes-Magaña
|
Helena Gómez-Adorno
|
Gemma Bel-Enguix
Proceedings of the 13th International Workshop on Semantic Evaluation
This paper presents our approach to the Task 5 of Semeval-2019, which aims at detecting hate speech against immigrants and women in Twitter. The task consists of two sub-tasks, in Spanish and English: (A) detection of hate speech and (B) classification of hateful tweets as aggressive or not, and identification of the target harassed as individual or group. We used linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, POS, among others). For task A, we trained a Support Vector Machine using a combinatorial framework, whereas for task B we followed a multi-labeled approach using the Random Forest classifier. Our approach achieved the highest F1-score in sub-task A for the Spanish language.