Mario Mina


2024

pdf bib
A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages
Jorge Palomar-Giner | Jose Javier Saiz | Ferran Espuña | Mario Mina | Severino Da Dalt | Joan Llop | Malte Ostendorff | Pedro Ortiz Suarez | Georg Rehm | Aitor Gonzalez-Agirre | Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present and describe two language resources in this paper: CATalog 1.0, the largest text corpus in Catalan to date, and CURATE (Corpus Utility for RAting TExt), a modular, parallelizable pipeline used for processing and scoring documents based on text quality that we have optimised to run in High Performance Cluster (HPC) environments. In the coming sections we describe our data preprocessing pipeline at length; traditional pipelines usually implement a set of binary filters such that a given document is either in or out. In our experience with Catalan, in lower-resource settings it is more practical to instead assign a document a soft score to allow for more flexible decision-making. We describe how the document score is calculated and highlight its interpretability by showing that it is significantly correlated with human judgements as obtained from a comparative judgement experiment. We additionally describe the different subcorpora that make up CATalog 1.0.

pdf bib
Extending Off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case
Mario Mina | Carlos Rodríguez | Aitor Gonzalez-Agirre | Marta Villegas
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)

We present the findings and results of our pseudonymisation system, which has been developed for a real-life use-case involving users and an informative chatbot in the context of the COVID-19 pandemic. Message exchanges between the two involve the former group providing information about themselves and their residential area, which could easily allow for their re-identification. We create a modular pipeline to detect PIIs and perform basic deidentification such that the data can be stored while mitigating any privacy concerns. The use-case presents several challenging aspects, the most difficult of which is the logistic challenge of not being able to directly view or access the data due to the very privacy issues we aim to resolve. Nevertheless, our system achieves a high recall of 0.99, correctly identifying almost all instances of personal data. However, this comes at the expense of precision, which only reaches 0.64. We describe the sensitive information identification in detail, explaining the design principles behind our decisions. We additionally highlight the particular challenges we’ve encountered.

2019

pdf bib
Saarland at MRP 2019: Compositional parsing across all graphbanks
Lucia Donatelli | Meaghan Fowlie | Jonas Groschwitz | Alexander Koller | Matthias Lindemann | Mario Mina | Pia Weißenhorn
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

We describe the Saarland University submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2019 Conference on Computational Natural Language Learning (CoNLL).