2024
pdf
bib
abs
Teanga Data Model for Linked Corpora
John P. McCrae
|
Priya Rani
|
Adrian Doyle
|
Bernardo Stearns
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.
2023
pdf
bib
The Cardamom Workbench for Historical and Under-Resourced Languages
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
|
Oksana Dereza
|
Priya Rani
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
A new learner language data set for the study of English for Specific Purposes at university
Cyriel Mallart
|
Nicolas Ballier
|
Jen-Yu Li
|
Andrew Simpkin
|
Bernardo Stearns
|
Rémi Venant
|
Thomas Gaillat
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
abs
Exploring a New Grammatico-functional Type of Measure as Part of a Language Learning Expert System
Cyriel Mallart
|
Andrew Simpkin
|
Rmi Venant
|
Nicolas Ballier
|
Bernardo Stearns
|
Jen Yu Li
|
Thomas Gaillat
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
This paper explores the use of L2-specific grammatical microsystems as elements of the domain knowledge of an Intelligent Computer-assisted Language Learning (ICALL) system. We report on the design of new grammatico-functional measures and their association with proficiency. We illustrate the approach with the design of the IT, THIS, THAT proform microsystem. The measures rely on the paradigmatic relations between words of the same linguistic functions. They are operationalised with one frequency-based and two probabilistic methods, i.e., the relative proportions of the forms and their likelihood of occurrence. Ordinal regression models show that the measures are significant in terms of association with CEFR levels, paving the way for their introduction in a specific proform microsystem expert model.
pdf
bib
abs
Findings of the SIGTYP 2023 Shared task on Cognate and Derivative Detection For Low-Resourced Languages
Priya Rani
|
Koustava Goswami
|
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper describes the structure and findings of the SIGTYP 2023 shared task on cognate and derivative detection for low-resourced languages, broken down into a supervised and unsupervised sub-task. The participants were asked to submit the test data’s final prediction. A total of nine teams registered for the shared task where seven teams registered for both sub-tasks. Only two participants ended up submitting system descriptions, with only one submitting systems for both sub-tasks. While all systems show a rather promising performance, all could be within the baseline score for the supervised sub-task. However, the system submitted for the unsupervised sub-task outperforms the baseline score.
2022
pdf
bib
abs
Towards Classification of Legal Pharmaceutical Text using GAN-BERT
Tapan Auti
|
Rajdeep Sarkar
|
Bernardo Stearns
|
Atul Kr. Ojha
|
Arindam Paul
|
Michaela Comerford
|
Jay Megaro
|
John Mariano
|
Vall Herard
|
John P. McCrae
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference
Pharmaceutical text classification is an important area of research for commercial and research institutions working in the pharmaceutical domain. Addressing this task is challenging due to the need of expert verified labelled data which can be expensive and time consuming to obtain. Towards this end, we leverage predictive coding methods for the task as they have been shown to generalise well for sentence classification. Specifically, we utilise GAN-BERT architecture to classify pharmaceutical texts. To capture the domain specificity, we propose to utilise the BioBERT model as our BERT model in the GAN-BERT framework. We conduct extensive evaluation to show the efficacy of our approach over baselines on multiple metrics.
2020
pdf
bib
abs
Un prototype en ligne pour la prédiction du niveau de compétence en anglais des productions écrites (A prototype for web-based prediction of English proficiency levels in writings)
Thomas Gaillat
|
Nicolas Ballier
|
Annanda Sousa
|
Manon Bouyé
|
Andrew Simpkin
|
Bernardo Stearns
|
Manel Zarrouk
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 4 : Démonstrations et résumés d'articles internationaux
Cet article décrit un prototype axé sur la prédiction du niveau de compétence des apprenants de l’anglais. Le système repose sur un modèle d’apprentissage supervisé, couplé à une interface web.
pdf
bib
abs
From Linguistic Research Projects to Language Technology Platforms: A Case Study in Learner Data
Annanda Sousa
|
Nicolas Ballier
|
Thomas Gaillat
|
Bernardo Stearns
|
Manel Zarrouk
|
Andrew Simpkin
|
Manon Bouyé
Proceedings of the 1st International Workshop on Language Technology Platforms
This paper describes the workflow and architecture adopted by a linguistic research project. We report our experience and present the research outputs turned into resources that we wish to share with the community. We discuss the current limitations and the next steps that could be taken for the scaling and development of our research project. Allying NLP and language-centric AI, we discuss similar projects and possible ways to start collaborating towards potential platform interoperability.
2019
pdf
bib
Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription
Bharathi Raja Chakravarthi
|
Ruba Priyadharshini
|
Bernardo Stearns
|
Arun Jayapal
|
Sridevy S
|
Mihael Arcan
|
Manel Zarrouk
|
John P McCrae
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
2018
pdf
bib
abs
Implicit and Explicit Aspect Extraction in Financial Microblogs
Thomas Gaillat
|
Bernardo Stearns
|
Gopal Sridhar
|
Ross McDermott
|
Manel Zarrouk
|
Brian Davis
Proceedings of the First Workshop on Economics and Natural Language Processing
This paper focuses on aspect extraction which is a sub-task of Aspect-based Sentiment Analysis. The goal is to report an extraction method of financial aspects in microblog messages. Our approach uses a stock-investment taxonomy for the identification of explicit and implicit aspects. We compare supervised and unsupervised methods to assign predefined categories at message level. Results on 7 aspect classes show 0.71 accuracy, while the 32 class classification gives 0.82 accuracy for messages containing explicit aspects and 0.35 for implicit aspects.