Tharindu Ranasinghe


2024

pdf bib
DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj | Sultan Almujaiwel | Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.

pdf bib
DORE: A Dataset for Portuguese Definition Generation
Anna Beatriz Dimas Furtado | Tharindu Ranasinghe | Frederic Blain | Ruslan Mitkov
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Definition modelling (DM) is the task of automatically generating a dictionary definition of a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

pdf bib
Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum | Tharindu Ranasinghe | Christoph Purschke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual zero-shot experiments that could benefit many low-resource languages.

pdf bib
MentalHelp: A Multi-Task Dataset for Mental Health in Social Media
Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Shafkat Farabi | Ana-Maria Bucur | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Early detection of mental health disorders is an essential step in treating and preventing mental health conditions. Computational approaches have been applied to users’ social media profiles in an attempt to identify various mental health conditions such as depression, PTSD, schizophrenia, and eating disorders. The interest in this topic has motivated the creation of various depression detection datasets. However, annotating such datasets is expensive and time-consuming, limiting their size and scope. To overcome this limitation, we present MentalHelp, a large-scale semi-supervised mental disorder detection dataset containing 14 million instances. The corpus was collected from Reddit and labeled in a semi-supervised way using an ensemble of three separate models - flan-T5, Disor-BERT, and Mental-BERT.

pdf bib
NSina: A News Corpus for Sinhala
Hansi Hettiarachchi | Damith Premasiri | Lasitha Randunu Chandrakantha Uyangodage | Tharindu Ranasinghe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

pdf bib
A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Marcos Zampieri | Damith Premasiri | Tharindu Ranasinghe
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users’ privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

pdf bib
The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Sanja Štajner | Marcos Zampieri | Horacio Saggion
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

pdf bib
An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Marcos Zampieri | Horacio Saggion
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

2023

pdf bib
Offensive Language Identification in Transliterated and Code-Mixed Bangla
Md Nishat Raihan | Umma Tanmoy | Anika Binte Islam | Kai North | Tharindu Ranasinghe | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Identifying offensive content in social media is vital to create safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

pdf bib
Target-Based Offensive Language Identification
Marcos Zampieri | Skye Morgan | Kai North | Tharindu Ranasinghe | Austin Simmmons | Paridhi Khandelwal | Sara Rosenthal | Preslav Nakov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present TBO, a new dataset for Target-based Offensive language identification. TBO contains post-level annotations regarding the harmfulness of an offensive post and token-level annotations comprising of the target and the offensive argument expression. Popular offensive language identification datasets for social media focus on annotation taxonomies only at the post level and more recently, some datasets have been released that feature only token-level annotations. TBO is an important resource that bridges the gap between post-level and token-level annotation datasets by introducing a single comprehensive unified annotation taxonomy. We use the TBO taxonomy to annotate post-level and token-level offensive language on English Twitter posts. We release an initial dataset of over 4,500 instances collected from Twitter and we carry out multiple experiments to compare the performance of different models trained and tested on TBO.

pdf bib
Teacher and Student Models of Offensive Language in Social Media
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: ACL 2023

State-of-the-art approaches to identifying offensive language online make use of large pre-trained transformer models. However, the inference time, disk, and memory requirements of these transformer models present challenges for their wide usage in the real world. Even the distilled transformer models remain prohibitively large for many usage scenarios. To cope with these challenges, in this paper, we propose transferring knowledge from transformer models to much smaller neural models to make predictions at the token- and at the post-level. We show that this approach leads to lightweight offensive language identification models that perform on par with large transformers but with 100 times fewer parameters and much less memory usage

pdf bib
A Multi-task Learning Framework for Quality Estimation
Sourabh Deoghare | Paramveer Choudhary | Diptesh Kanojia | Tharindu Ranasinghe | Pushpak Bhattacharyya | Constantin Orăsan
Findings of the Association for Computational Linguistics: ACL 2023

Quality Estimation (QE) is the task of evaluating machine translation output in the absence of reference translation. Conventional approaches to QE involve training separate models at different levels of granularity viz., word-level, sentence-level, and document-level, which sometimes lead to inconsistent predictions for the same input. To overcome this limitation, we focus on jointly training a single model for sentence-level and word-level QE tasks in a multi-task learning framework. Using two multi-task learning-based QE approaches, we show that multi-task learning improves the performance of both tasks. We evaluate these approaches by performing experiments in different settings, viz., single-pair, multi-pair, and zero-shot. We compare the multi-task learning-based approach with baseline QE models trained on single tasks and observe an improvement of up to 4.28% in Pearson’s correlation (r) at sentence-level and 8.46% in F1-score at word-level, in the single-pair setting. In the multi-pair setting, we observe improvements of up to 3.04% at sentence-level and 13.74% at word-level; while in the zero-shot setting, we also observe improvements of up to 5.26% and 3.05%, respectively. We make the models proposed in this paper publically available.

pdf bib
Quality Estimation-Assisted Automatic Post-Editing
Sourabh Deoghare | Diptesh Kanojia | Fred Blain | Tharindu Ranasinghe | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2023

Automatic Post-Editing (APE) systems are prone to over-correction of the Machine Translation (MT) outputs. While Word-level Quality Estimation (QE) system can provide a way to curtail the over-correction, a significant performance gain has not been observed thus far by utilizing existing APE and QE combination strategies. In this paper, we propose joint training of a model on APE and QE tasks to improve the APE. Our proposed approach utilizes a multi-task learning (MTL) methodology, which shows significant improvement while treating both tasks as a ‘bargaining game’ during training. Moreover, we investigate various existing combination strategies and show that our approach achieves state-of-the-art performance for a ‘distant’ language pair, viz., English-Marathi. We observe an improvement of 1.09 TER and 1.37 BLEU points over a baseline QE-Unassisted APE system for English-Marathi, while also observing 0.46 TER and 0.62 BLEU points for English-German. Further, we discuss the results qualitatively and show how our approach helps reduce over-correction, thereby improving the APE performance. We also observe that the degree of integration between QE and APE directly correlates with the APE performance gain. We release our code and models publicly.

pdf bib
A Text-to-Text Model for Multilingual Offensive Language Identification
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib
ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of the original sentence. The TSAR-2022 shared task on LS provided participants with a multilingual lexical simplification test set. It contained nearly 1,200 complex words in English, Portuguese, and Spanish and presented multiple candidate substitutions for each complex word. The competition did not make training data available; therefore, teams had to use either off-the-shelf pre-trained large language models (LLMs) or out-domain data to develop their LS systems. As such, participants were unable to fully explore the capabilities of LLMs by re-training and/or fine-tuning them on in-domain data. To address this important limitation, we present ALEXSIS+, a multilingual dataset in the aforementioned three languages, and ALEXSIS++, an English monolingual dataset that together contains more than 50,000 unique sentences retrieved from news corpora and annotated with cosine similarities to the original complex word and sentence. Using these additional contexts, we are able to generate new high-quality candidate substitutions that improve LS performance on the TSAR-2022 test set regardless of the language or model.

pdf bib
SurreyAI 2023 Submission for the Quality Estimation Shared Task
Archchana Sindhujan | Diptesh Kanojia | Constantin Orasan | Tharindu Ranasinghe
Proceedings of the Eighth Conference on Machine Translation

Quality Estimation (QE) systems are important in situations where it is necessary to assess the quality of translations, but there is no reference available. This paper describes the approach adopted by the SurreyAI team for addressing the Sentence-Level Direct Assessment shared task in WMT23. The proposed approach builds upon the TransQuest framework, exploring various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings. The autoencoder pre-trained language models employed in the proposed systems are XLMV, InfoXLM-large, and XLMR-large. The evaluation utilizes Spearman and Pearson correlation coefficients, assessing the relationship between machine-predicted quality scores and human judgments for 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil and English-Telugu). The MonoTQ-InfoXLM-large approach emerges as a robust strategy, surpassing all other individual models proposed in this study by significantly improving over the baseline for the majority of the language pairs.

pdf bib
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Mary Nurminen | Judith Brenner | Maarit Koponen | Sirkku Latomaa | Mikhail Mikhailov | Frederike Schierl | Tharindu Ranasinghe | Eva Vanmassenhove | Sergi Alvarez Vidal | Nora Aranberri | Mara Nunziatini | Carla Parra Escartín | Mikel Forcada | Maja Popovic | Carolina Scarton | Helena Moniz
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

pdf bib
Deep Learning Approaches to Detecting Safeguarding Concerns in Schoolchildren’s Online Conversations
Emma Franklin | Tharindu Ranasinghe
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

For school teachers and Designated Safeguarding Leads (DSLs), computers and other school-owned communication devices are both indispensable and deeply worrisome. For their education, children require access to the Internet, as well as a standard institutional ICT infrastructure, including e-mail and other forms of online communication technology. Given the sheer volume of data being generated and shared on a daily basis within schools, most teachers and DSLs can no longer monitor the safety and wellbeing of their students without the use of specialist safeguarding software. In this paper, we experiment with the use of state-of-the-art neural network models on the modelling of a dataset of almost 9,000 anonymised child-generated chat messages on the Microsoft Teams platform. The data was manually classified into eight fine-grained classes of safeguarding concerns (or false alarms) that a monitoring program would be interested in, and these were further split into two binary classes: true positives (real safeguarding concerns) and false positives (false alarms). For the fine grained classification, our models achieved a macro F1 score of 73.56, while for the binary classification, we achieved a macro F1 score of 87.32. This first experiment into the use of Deep Learning for detecting safeguarding concerns represents an important step towards achieving high-accuracy and reliable monitoring information for busy teachers and safeguarding leads.

pdf bib
Explainable Event Detection with Event Trigger Identification as Rationale Extraction
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Most event detection methods act at the sentence-level and focus on identifying sentences related to a particular event. However, identifying certain parts of a sentence that act as event triggers is also important and more challenging, especially when dealing with limited training data. Previous event detection attempts have considered these two tasks separately and have developed different methods. We hypothesise that similar to humans, successful sentence-level event detection models rely on event triggers to predict sentence-level labels. By exploring feature attribution methods that assign relevance scores to the inputs to explain model predictions, we study the behaviour of state-of-the-art sentence-level event detection models and show that explanations (i.e. rationales) extracted from these models can indeed be used to detect event triggers. We, therefore, (i) introduce a novel weakly-supervised method for event trigger detection; and (ii) propose to use event triggers as an explainable measure in sentence-level event detection. To the best of our knowledge, this is the first explainable machine learning approach to event trigger identification.

pdf bib
Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

pdf bib
Deep Learning Methods for Identification of Multiword Flower and Plant Names
Damith Premasiri | Amal Haddad Haddad | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Multiword Terms (MWTs) are domain-specific Multiword Expressions (MWE) where two or more lexemes converge to form a new unit of meaning. The task of processing MWTs is crucial in many Natural Language Processing (NLP) applications, including Machine Translation (MT) and terminology extraction. However, the automatic detection of those terms is a difficult task and more research is still required to give more insightful and useful results in this field. In this study, we seek to fill this gap using state-of-the-art transformer models. We evaluate both BERT like discriminative transformer models and generative pre-trained transformer (GPT) models on this task, and we show that discriminative models perform better than current GPT models in multi-word terms identification task in flower and plant names in English and Spanish languages. Best discriminate models perform 94.3127%, 82.1733% F1 scores in English and Spanish data, respectively while ChatGPT could only perform 63.3183% and 47.7925% respectively.

pdf bib
Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe | Alistair Plum | Christoph Purschke | Marcos Zampieri
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages

pdf bib
Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive
Tharindu Weerasooriya | Sujan Dutta | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan | Ashiqur KhudaBukhsh
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a ***noise audit*** at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of ***vicarious offense***. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense. The dataset is available through https://github.com/Homan-Lab/voiced.

2022

pdf bib
GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of the task is to automatically provide a set of candidate substitutions for complex words in context. The organizers provided participants with ALEXSIS a manually annotated dataset with instances split between a small trial set with a dozen instances in each of the three languages of the competition (English, Portuguese, Spanish) and a test set with over 300 instances in the three aforementioned languages. To cope with the lack of training data, participants had to either use alternative data sources or pre-trained language models. We experimented with monolingual models: BERTimbau, ELECTRA, and RoBERTA-largeBNE. Our best system achieved 1st place out of sixteen systems for Portuguese, 8th out of thirty-three systems for English, and 6th out of twelve systems for Spanish.

pdf bib
ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
Kai North | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the 29th International Conference on Computational Linguistics

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their potential substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS-ES protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated three models for substitute generation on this dataset, namely mBERT, XLM-R, and BERTimbau. The latter achieved the highest performance across all evaluation metrics.

pdf bib
DTW at Qur’an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain
Damith Premasiri | Tharindu Ranasinghe | Wajdi Zaghouani | Ruslan Mitkov
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur’an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur’an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task.

2021

pdf bib
An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that multilingual QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.

pdf bib
MUDES: Multilingual Detection of Offensive Spans
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

pdf bib
TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Identifying whether a word carries the same meaning or different meaning in two contexts is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. Most of the previous work in this area rely on language-specific resources making it difficult to generalise across languages. Considering this limitation, our approach to SemEval-2021 Task 2 is based only on pretrained transformer models and does not use any language-specific processing and resources. Despite that, our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.

pdf bib
WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans
Tharindu Ranasinghe | Diptanu Sarkar | Marcos Zampieri | Alexander Ororbia
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an 0.68 F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

pdf bib
Transformers to Fight the COVID-19 Infodemic
Lasitha Uyangodage | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

The massive spread of false information on social media has become a global risk especially in a global pandemic situation like COVID-19. False information detection has thus become a surging research topic in recent months. NLP4IF-2021 shared task on fighting the COVID-19 infodemic has been organised to strengthen the research in false information detection where the participants are asked to predict seven different binary labels regarding false information in a tweet. The shared task has been organised in three languages; Arabic, Bulgarian and English. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves a 0.707 mean F1 score in Arabic, 0.578 mean F1 score in Bulgarian and 0.864 mean F1 score in English ranking 4th place in all the languages.

pdf bib
fBERT: A Neural Transformer for Identifying Offensive Content
Diptanu Sarkar | Marcos Zampieri | Tharindu Ranasinghe | Alexander Ororbia
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over 1.4 million offensive instances. We evaluate fBERT’s performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.

pdf bib
Discovering Black Lives Matter Events in the United States: Shared Task 3, CASE 2021
Salvatore Giorgi | Vanni Zavarella | Hristo Tanev | Nicolas Stefanovitch | Sy Hwang | Hansi Hettiarachchi | Tharindu Ranasinghe | Vivek Kalyan | Paul Tan | Shaun Tan | Martin Andrews | Tiancheng Hu | Niklas Stoehr | Francesco Ignazio Re | Daniel Vegh | Dennis Atzenhofer | Brenda Curtis | Ali Hürriyetoğlu
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

Evaluating the state-of-the-art event detection systems on determining spatio-temporal distribution of the events on the ground is performed unfrequently. But, the ability to both (1) extract events “in the wild” from text and (2) properly evaluate event detection systems has potential to support a wide variety of tasks such as monitoring the activity of socio-political movements, examining media coverage and public support of these movements, and informing policy decisions. Therefore, we study performance of the best event detection systems on detecting Black Lives Matter (BLM) events from tweets and news articles. The murder of George Floyd, an unarmed Black man, at the hands of police officers received global attention throughout the second half of 2020. Protests against police violence emerged worldwide and the BLM movement, which was once mostly regulated to the United States, was now seeing activity globally. This shared task asks participants to identify BLM related events from large unstructured data sources, using systems pretrained to extract socio-political events from text. We evaluate several metrics, accessing each system’s ability to identify protest events both temporally and spatially. Results show that identifying daily protest counts is an easier task than classifying spatial and temporal protest trends simultaneously, with maximum performance of 0.745 and 0.210 (Pearson r), respectively. Additionally, all baselines and participant systems suffered from low recall, with a maximum recall of 5.08.

pdf bib
Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation
Diptesh Kanojia | Marina Fomicheva | Tharindu Ranasinghe | Frédéric Blain | Constantin Orăsan | Lucia Specia
Proceedings of the Sixth Conference on Machine Translation

Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.

pdf bib
Comparing Approaches to Dravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.

pdf bib
Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi
Saurabh Sampatrao Gaikwad | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

pdf bib
Can Multilingual Transformers Fight the COVID-19 Infodemic?
Lasitha Uyangodage | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The massive spread of false information on social media has become a global risk especially in a global pandemic situation like COVID-19. False information detection has thus become a surging research topic in recent months. In recent years, supervised machine learning models have been used to automatically identify false information in social media. However, most of these machine learning models focus only on the language they were trained on. Given the fact that social media platforms are being used in different languages, managing machine learning models for each and every language separately would be chaotic. In this research, we experiment with multilingual models to identify false information in social media by using two recently released multilingual false information detection datasets. We show that multilingual models perform on par with the monolingual models and sometimes even better than the monolingual models to detect false information in social media making them more useful in real-world scenarios.

pdf bib
WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments
Skye Morgan | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

2020

pdf bib
Multilingual Offensive Language Identification with Cross-lingual Embeddings
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.

pdf bib
TransQuest at WMT2020: Sentence-Level Direct Assessment
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fifth Conference on Machine Translation

This paper presents the team TransQuest’s participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.

pdf bib
Intelligent Translation Memory Matching and Retrieval with Sentence Encoders
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Matching and retrieving previously translated segments from the Translation Memory is a key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper, we introduce sentence encoders to improve matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance-based algorithms.

pdf bib
TransQuest: Translation Quality Estimation with Cross-lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 28th International Conference on Computational Linguistics

Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.

pdf bib
InfoMiner at WNUT-2020 Task 2: Transformer-based Covid-19 Informative Tweet Extraction
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Identifying informative tweets is an important step when building information extraction systems based on social media. WNUT-2020 Task 2 was organised to recognise informative tweets from noise tweets. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves 10th place in the final rankings scoring 0.9004 F1 score for the test set.

pdf bib
Offensive Language Identification in Greek
Zesis Pitenis | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the Twelfth Language Resources and Evaluation Conference

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.

pdf bib
BRUMS at SemEval-2020 Task 3: Contextualised Embeddings for Predicting the (Graded) Effect of Context in Word Similarity
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the team BRUMS submission to SemEval-2020 Task 3: Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.

pdf bib
RGCL at SemEval-2020 Task 6: Neural Approaches to DefinitionExtraction
Tharindu Ranasinghe | Alistair Plum | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.

pdf bib
BRUMS at SemEval-2020 Task 12: Transformer Based Multilingual Offensive Language Identification in Social Media
Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe the team BRUMS entry to OffensEval 2: Multilingual Offensive Language Identification in Social Media in SemEval-2020. The OffensEval organizers provided participants with annotated datasets containing posts from social media in Arabic, Danish, English, Greek and Turkish. We present a multilingual deep learning model to identify offensive language in social media. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility between languages.

2019

pdf bib
RGCL-WLV at SemEval-2019 Task 12: Toponym Detection
Alistair Plum | Tharindu Ranasinghe | Pablo Calleja | Constantin Orăsan | Ruslan Mitkov
Proceedings of the 13th International Workshop on Semantic Evaluation

This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.

pdf bib
Emoji Powered Capsule Network to Detect Type and Target of Offensive Posts in Social Media
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper describes a novel research approach to detect type and target of offensive posts in social media using a capsule network. The input to the network was character embeddings combined with emoji embeddings. The approach was evaluated on all three subtasks in Task 6 - SemEval 2019: OffensEval: Identifying and Categorizing Offensive Language in Social Media. The evaluation also showed that even though the capsule networks have not been used commonly in natural language processing tasks, they can outperform existing state of the art solutions for offensive language detection in social media.

pdf bib
Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning
Alistair Plum | Tharindu Ranasinghe | Constantin Orasan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.

pdf bib
Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains

pdf bib
Semantic Textual Similarity with Siamese Neural Networks
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods
Search
Co-authors