Abinew Ayele


2024

pdf bib
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Nedjma Ousidhoum | Shamsuddeen Muhammad | Mohamed Abdalla | Idris Abdulmumin | Ibrahim Ahmad | Sanchit Ahuja | Alham Aji | Vladimir Araujo | Abinew Ayele | Pavan Baswani | Meriem Beloucif | Chris Biemann | Sofia Bourhim | Christine Kock | Genet Dekebo | Oumaima Hourrane | Gopichand Kanumolu | Lokesh Madasu | Samuel Rutunda | Manish Shrivastava | Thamar Solorio | Nirmal Surange | Hailegnaw Tilaye | Krishnapriya Vishnubhotla | Genta Winata | Seid Yimam | Saif Mohammad
Findings of the Association for Computational Linguistics ACL 2024

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

2023

pdf bib
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Shamsuddeen Muhammad | Idris Abdulmumin | Abinew Ayele | Nedjma Ousidhoum | David Adelani | Seid Yimam | Ibrahim Ahmad | Meriem Beloucif | Saif Mohammad | Sebastian Ruder | Oumaima Hourrane | Alipio Jorge | Pavel Brazdil | Felermino Ali | Davis David | Salomey Osei | Bello Shehu-Bello | Falalu Lawan | Tajuddeen Gwadabe | Samuel Rutunda | Tadesse Destaw Belay | Wendimu Messelle | Hailu Balcha | Sisay Chala | Hagos Gebremichael | Bernard Opoku | Stephen Arthur
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Africa is home to over 2,000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (with over 200 participants, see website: https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the AfriSenti datasets and discuss their usefulness.

2022

pdf bib
Question Answering Classification for Amharic Social Media Community Based Questions
Tadesse Destaw Belay | Seid Muhie Yimam | Abinew Ayele | Chris Biemann
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

In this work, we build a Question Answering (QA) classification dataset from a social media platform, namely the Telegram public channel called @AskAnythingEthiopia. The channel has more than 78k subscribers and has existed since May 31, 2019. The platform allows asking questions that belong to various domains, like politics, economics, health, education, and so on. Since the questions are posed in a mixed-code, we apply different strategies to pre-process the dataset. Questions are posted in Amharic, English, or Amharic but in a Latin script. As part of the pre-processing tools, we build a Latin to Ethiopic Script transliteration tool. We collect 8k Amharic and 24K transliterated questions and develop deep learning-based questions answering classifiers that attain as high as an F-score of 57.29 in 20 different question classes or categories. The datasets and pre-processing scripts are open-sourced to facilitate further research on the Amharic community-based question answering.

2021

bib
The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic
Tadesse Destaw Belay | Abinew Ayele | Seid Muhie Yimam
Proceedings of the Fifth Workshop on Widening Natural Language Processing

Amharic is the second most spoken Semitic language after Arabic and serves as the official working language of Ethiopia. While Amharic NLP research is getting wider attention recently, the main bottleneck is that the resources and related tools are not publicly released, which makes it still a low-resource language. Due to this reason, we observe that different researchers try to repeat the same NLP research again and again. In this work, we investigate the existing approach in Amharic NLP and take the first step to publicly release tools, datasets, and models to advance Amharic NLP research. We build Python-based preprocessing tools for Amharic (tokenizer, sentence segmenter, and text cleaner) that can easily be used and integrated for the development of NLP applications. Furthermore, we compiled the first moderately large-scale Amharic text corpus (6.8m sentences) along with the word2Vec, fastText, RoBERTa, and FLAIR embeddings models. Finally, we compile benchmark datasets and build classification models for the named entity recognition task.

2020

pdf bib
Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models
Seid Muhie Yimam | Hizkiel Mitiku Alemayehu | Abinew Ayele | Chris Biemann
Proceedings of the 28th International Conference on Computational Linguistics

This paper presents the study of sentiment analysis for Amharic social media texts. As the number of social media users is ever-increasing, social media platforms would like to understand the latent meaning and sentiments of a text to enhance decision-making procedures. However, low-resource languages such as Amharic have received less attention due to several reasons such as lack of well-annotated datasets, unavailability of computing resources, and fewer or no expert researchers in the area. This research addresses three main research questions. We first explore the suitability of existing tools for the sentiment analysis task. Annotation tools are scarce to support large-scale annotation tasks in Amharic. Also, the existing crowdsourcing platforms do not support Amharic text annotation. Hence, we build a social-network-friendly annotation tool called ‘ASAB’ using the Telegram bot. We collect 9.4k tweets, where each tweet is annotated by three Telegram users. Moreover, we explore the suitability of machine learning approaches for Amharic sentiment analysis. The FLAIR deep learning text classifier, based on network embeddings that are computed from a distributional thesaurus, outperforms other supervised classifiers. We further investigate the challenges in building a sentiment analysis system for Amharic and we found that the widespread usage of sarcasm and figurative speech are the main issues in dealing with the problem. To advance the sentiment analysis research in Amharic and other related low-resource languages, we release the dataset, the annotation tool, source code, and models publicly under a permissive.