2024
pdf
bib
abs
AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Giovanni Puccetti
|
Anna Rogers
|
Chiara Alzetta
|
Felice Dell’Orletta
|
Andrea Esuli
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.
2019
pdf
bib
abs
Linguistically-Driven Strategy for Concept Prerequisites Learning on Italian
Alessio Miaschi
|
Chiara Alzetta
|
Franco Alberto Cardillo
|
Felice Dell’Orletta
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
We present a new concept prerequisite learning method for Learning Object (LO) ordering that exploits only linguistic features extracted from textual educational resources. The method was tested in a cross- and in- domain scenario both for Italian and English. Additionally, we performed experiments based on a incremental training strategy to study the impact of the training set size on the classifier performances. The paper also introduces ITA-PREREQ, to the best of our knowledge the first Italian dataset annotated with prerequisite relations between pairs of educational concepts, and describe the automatic strategy devised to build it.
2018
pdf
bib
Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Maria Simi
|
Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.
2017
pdf
bib
Dangerous Relations in Dependency Treebanks
Chiara Alzetta
|
Felice Dell’Orletta
|
Simonetta Montemagni
|
Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories