2024
pdf
bib
abs
Compilation of a Synthetic Judeo-French Corpus
Iglika Nikolova-Stoupak
|
Gaél Lejeune
|
Eva Schaeffer-Lacroix
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
This is a short paper describing the process of derivation of synthetic Judeo-French text. Judeo-French is one of a number of rare languages used in speaking and writing by Jewish communities as confined to a particular temporal and geographical frame (in this case, 11th- to 14th-century France). The number of resources in the language is very limited and its involvement in the contemporary domain of Natural Language Processing (NLP) is practically non-existent. This work outlines the compilation of a synthetic Judeo-French corpus. For the purpose, a pipeline of transformations is applied to Old French text belonging to the same general time period, leading to the derivation of text that is as reliable as possible in terms of phonological, morphological and lexical characteristics as witnessed in Judeo-French. Ultimately, the goal is for this synthetic corpus to be used in standard NLP tasks, such as Neural Machine Translation (NMT), as an instance of data augmentation.
pdf
bib
abs
LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison
Iglika Nikolova-Stoupak
|
Serge Bibauw
|
Amandine Dumont
|
Françoise Stas
|
Patrick Watrin
|
Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
This project evaluates the potential of LLM and dynamic corpora to generate contexts ai- med at the practice and acquisition of specialised English vocabulary. We compared reference contexts—handpicked by expert teachers—for a specialised vocabulary list to contexts generated by three recent large language models (LLM) of different sizes (Mistral-7B-Instruct, Vicuna-13B, and Gemini 1.0 Pro) and to contexts extracted from articles web-crawled from specialised websites. The comparison uses a representative set of length-based, morphosyntactic, semantic, and discourse- related textual characteristics. We conclude that the LLM-based corpora can be combined effectively with a web-crawled one to form an academic corpus characterised by appropriate complexity and textual variety.
2022
pdf
bib
abs
Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
Iglika Nikolova-Stoupak
|
Shuichiro Shimizu
|
Chenhui Chu
|
Sadao Kurohashi
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.
2020
pdf
bib
abs
A Natural Language for Bulgarian Primary and Secondary Education
Iglika Nikolova-Stoupak
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
This paper examines the qualities and applicability of a provisional programming language, especially designed for use by beginner-level students in Bulgarian primary and secondary schools. The necessity for such a language is investigated. Then, relevant features are defined, as inspired by various programming languages (notably, languages used in education and characterised with non- English syntax) and by general trends related to the achievement of natural language in software development. A survey is conducted to test young students’ interaction with the language, and the latter’s advantages and limitations are listed and discussed.