2024
pdf
bib
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
Isuri Anuradha
|
Martin Wynne
|
Francesca Frontini
|
Alistair Plum
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
pdf
bib
abs
Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum
|
Tharindu Ranasinghe
|
Christoph Purschke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual zero-shot experiments that could benefit many low-resource languages.
2023
pdf
bib
abs
Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe
|
Alistair Plum
|
Christoph Purschke
|
Marcos Zampieri
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages
2021
pdf
bib
abs
Text Preprocessing and its Implications in a Digital Humanities Project
Maria Kunilovskaya
|
Alistair Plum
Proceedings of the Student Research Workshop Associated with RANLP 2021
This paper focuses on data cleaning as part of a preprocessing procedure applied to text data retrieved from the web. Although the importance of this early stage in a project using NLP methods is often highlighted by researchers, the details, general principles and techniques are usually left out due to consideration of space. At best, they are dismissed with a comment “The usual data cleaning and preprocessing procedures were applied”. More coverage is usually given to automatic text annotation such as lemmatisation, part-of-speech tagging and parsing, which is often included in preprocessing. In the literature, the term ‘preprocessing’ is used to refer to a wide range of procedures, from filtering and cleaning to data transformation such as stemming and numeric representation, which might create confusion. We argue that text preprocessing might skew original data distribution with regard to the metadata, such as types, locations and times of registered datapoints. In this paper we describe a systematic approach to cleaning text data mined by a data-providing company for a Digital Humanities (DH) project focused on cultural analytics. We reveal the types and amount of noise in the data coming from various web sources and estimate the changes in the size of the data associated with preprocessing. We also compare the results of a text classification experiment run on the raw and preprocessed data. We hope that our experience and approaches will help the DH community to diagnose the quality of textual data collected from the web and prepare it for further natural language processing.
2020
pdf
bib
abs
RGCL at SemEval-2020 Task 6: Neural Approaches to DefinitionExtraction
Tharindu Ranasinghe
|
Alistair Plum
|
Constantin Orasan
|
Ruslan Mitkov
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
2019
pdf
bib
abs
RGCL-WLV at SemEval-2019 Task 12: Toponym Detection
Alistair Plum
|
Tharindu Ranasinghe
|
Pablo Calleja
|
Constantin Orăsan
|
Ruslan Mitkov
Proceedings of the 13th International Workshop on Semantic Evaluation
This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.
pdf
bib
abs
Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning
Alistair Plum
|
Tharindu Ranasinghe
|
Constantin Orasan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.