Éric Villemonte De La Clergerie

Also published as: Eric Villemonte de la Clergerie, Eric de La Clergerie, Eric de la Clergerie, Éric Villemonte de La Clergerie, Éric Villemonte de la Clergerie, Éric de La Clergerie, Éric de la Clergerie


2024

pdf bib
CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data
Rian Touchent | Éric de la Clergerie
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.

pdf bib
On the Scaling Laws of Geographical Representation in Language Models
Nathan Godey | Éric de la Clergerie | Benoît Sagot
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.

2022

pdf bib
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
Louis Martin | Angela Fan | Éric de la Clergerie | Antoine Bordes | Benoît Sagot
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentence-level paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We further present a method to mine such paraphrase data in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.

pdf bib
MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling
Nathan Godey | Roman Castagné | Éric de la Clergerie | Benoît Sagot
Findings of the Association for Computational Linguistics: EMNLP 2022

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models’ downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.

2021

pdf bib
Building A Corporate Corpus For Threads Constitution
Lionel Tadonfouet Tadjou | Fabrice Bourge | Tiphaine Marie | Laurent Romary | Éric de la Clergerie
Proceedings of the Student Research Workshop Associated with RANLP 2021

In this paper we describe the process of build-ing a corporate corpus that will be used as a ref-erence for modelling and computing threadsfrom conversations generated using commu-nication and collaboration tools. The overallgoal of the reconstruction of threads is to beable to provide value to the collorator in var-ious use cases, such as higlighting the impor-tant parts of a running discussion, reviewingthe upcoming commitments or deadlines, etc. Since, to our knowledge, there is no avail-able corporate corpus for the French languagewhich could allow us to address this prob-lem of thread constitution, we present here amethod for building such corpora includingdifferent aspects and steps which allowed thecreation of a pipeline to pseudo-anonymisedata. Such a pipeline is a response to theconstraints induced by the General Data Pro-tection Regulation GDPR in Europe and thecompliance to the secrecy of correspondence.

2020

pdf bib
CamemBERT: a Tasty French Language Model
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric de la Clergerie | Djamé Seddah | Benoît Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

pdf bib
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
Murielle Popa-Fabre | Pedro Javier Ortiz Suárez | Benoît Sagot | Éric de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks. This paper offers three main contributions: (1) two newly built corpora: (a) CaBeRnet, a French Balanced Reference Corpus and (b) CBT-fr a domain-specific corpus having both oral and written style in youth literature, (2) five versions of ELMo pre-trained on differently built corpora, and (3) a whole array of computational results on downstream tasks that deepen our understanding of the effects of corpus balance and register in NLP evaluation.

pdf bib
Controllable Sentence Simplification
Louis Martin | Éric de la Clergerie | Benoît Sagot | Antoine Bordes
Proceedings of the Twelfth Language Resources and Evaluation Conference

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.

pdf bib
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity )
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

2019

pdf bib
INRIA at SemEval-2019 Task 9: Suggestion Mining Using SVM with Handcrafted Features
Ilia Markov | Eric Villemonte de la Clergerie
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the INRIA approach to the suggestion mining task at SemEval 2019. The task consists of two subtasks: suggestion mining under single-domain (Subtask A) and cross-domain (Subtask B) settings. We used the Support Vector Machines algorithm trained on handcrafted features, function words, sentiment features, digits, and verbs for Subtask A, and handcrafted features for Subtask B. Our best run archived a F1-score of 51.18% on Subtask A, and ranked in the top ten of the submissions for Subtask B with 73.30% F1-score.

pdf bib
Challenges of language change and variation: towards an extended treebank of Medieval French
Mathilde Regnault | Sophie Prévost | Eric Villemonte de la Clergerie
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib
Reference-less Quality Estimation of Text Simplification Systems
Louis Martin | Samuel Humeau | Pierre-Emmanuel Mazaré | Éric de La Clergerie | Antoine Bordes | Benoît Sagot
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing
Ganesh Jawahar | Benjamin Muller | Amal Fethi | Louis Martin | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present the details of the neural dependency parser and the neural tagger submitted by our team ‘ParisNLP’ to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth, we call our system ‘ELMoLex’. In addition to incorporating character embeddings, ELMoLex benefits from pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%).

pdf bib
ANCOR-AS: Enriching the ANCOR Corpus with Syntactic Annotations
Loïc Grobol | Isabelle Tellier | Éric de la Clergerie | Marco Dinarelli | Frédéric Landragin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Apports des analyses syntaxiques pour la détection automatique de mentions dans un corpus de français oral (Experiences in using deep and shallow parsing to detect entity mentions in oral French)
Loïc Grobol | Isabelle Tellier | Éric de La Clergerie | Marco Dinarelli | Frédéric Landragin
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Cet article présente trois expériences de détection de mentions dans un corpus de français oral : ANCOR. Ces expériences utilisent des outils préexistants d’analyse syntaxique du français et des méthodes issues de travaux sur la coréférence, les anaphores et la détection d’entités nommées. Bien que ces outils ne soient pas optimisés pour le traitement de l’oral, la qualité de la détection des mentions que nous obtenons est comparable à l’état de l’art des systèmes conçus pour l’écrit dans d’autres langues. Nous concluons en proposant des perspectives pour l’amélioration des résultats que nous obtenons et la construction d’un système end-to-end pour lequel nos expériences peuvent servir de base de travail.

pdf bib
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
Éric de La Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.

2016

pdf bib
Accurate Deep Syntactic Parsing of Graphs: The Case of French
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parsing predicate-argument structures in a deep syntax framework requires graphs to be predicted. Argument structures represent a higher level of abstraction than the syntactic ones and are thus more difficult to predict even for highly accurate parsing models on surfacic syntax. In this paper we investigate deep syntax parsing, using a French data set (Ribeyre et al., 2014a). We demonstrate that the use of topologically different types of syntactic features, such as dependencies, tree fragments, spines or syntactic paths, brings a much needed context to the parser. Our higher-order parsing model, gaining thus up to 4 points, establishes the state of the art for parsing French deep syntactic structures.

2015

pdf bib
Because Syntax Does Matter: Improving Predicate-Argument Structures Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Playing with parsers (Jouer avec des analyseurs syntaxiques) [in French]
Éric Villemonte de la Clergerie
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Alpage: Transition-based Semantic Graph Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Deep Syntax Annotation of the Sequoia French Treebank
Marie Candito | Guy Perrier | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah | Éric de la Clergerie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

pdf bib
Towards an environment for the production and the validation of lexical semantic resources
Mikaël Morardo | Éric Villemonte de la Clergerie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the components of a processing chain for the creation, visualization, and validation of lexical resources (formed of terms and relations between terms). The core of the chain is a component for building lexical networks relying on Harris’ distributional hypothesis applied on the syntactic dependencies produced by the French parser FRMG on large corpora. Another important aspect concerns the use of an online interface for the visualization and collaborative validation of the resulting resources.

2013

pdf bib
Exploring beam-based shift-reduce dependency parsing with DyALog: Results from the SPMRL 2013 shared task
Éric de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Sandra Kübler | Marie Candito | Jinho D. Choi | Richárd Farkas | Jennifer Foster | Iakes Goenaga | Koldo Gojenola Galletebeitia | Yoav Goldberg | Spence Green | Nizar Habash | Marco Kuhlmann | Wolfgang Maier | Joakim Nivre | Adam Przepiórkowski | Ryan Roth | Wolfgang Seeker | Yannick Versley | Veronika Vincze | Marcin Woliński | Alina Wróblewska | Eric Villemonte de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Improving a symbolic parser through partially supervised learning
Éric de la Clergerie
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

pdf bib
Evaluating and improving syntactic lexica by plugging them within a parser
Elsa Tolone | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon, and their integration within the large-coverage TAG-based FRMG parser. The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations and CoNLL dependencies. The information provided by the evaluation results provide valuable feedback about the four lexica. Moreover, when coupled with error mining techniques, they allow us to identify how these lexica might be improved.

pdf bib
Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Kata Gábor | Marianna Apidianaki | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article, we present a distributional analysis method for extracting nominalization relations from monolingual corpora. The acquisition method makes use of distributional and morphological information to select nominalization candidates. We explain how the learning is performed on a dependency annotated corpus and describe the nominalization results. Furthermore, we show how these results served to enrich an existing lexical resource, the WOLF (Wordnet Libre du Franc¸ais). We present the techniques that we developed in order to integrate the new information into WOLF, based on both its structure and content. Finally, we evaluate the validity of the automatically obtained information and the correctness of its integration into the semantic resource. The method proved to be useful for boosting the coverage of WOLF and presents the advantage of filling verbal synsets, which are particularly difficult to handle due to the high level of verbal polysemy.

pdf bib
A linguistically-motivated 2-stage Tree to Graph Transformation
Corentin Ribeyre | Djamé Seddah | Eric Villemonte de la Clergerie
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

2011

bib
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]
Éric Villemonte de La Clergerie | Béatrice Daille | Yves Lepage | François Yvon
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]

2010

pdf bib
Building factorized TAGs with meta-grammars
Éric Villemonte de la Clergerie
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

pdf bib
The Second Evaluation Campaign of PASSAGE on Parsing of French
Patrick Paroubek | Olivier Hamon | Eric de La Clergerie | Cyril Grouin | Anne Vilnat
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

pdf bib
PASSAGE Syntactic Representation: a Minimal Common Ground for Evaluation
Anne Vilnat | Patrick Paroubek | Eric Villemonte de la Clergerie | Gil Francopoulo | Marie-Laure Guénot
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The current PASSAGE syntactic representation is the result of 9 years of constant evolution with the aim of providing a common ground for evaluating parsers of French whatever their type and supporting theory. In this paper we present the latest developments concerning the formalism and show first through a review of basic linguistic phenomena that it is a plausible minimal common ground for representing French syntax in the context of generic black box quantitative objective evaluation. For the phenomena reviewed, which include: the notion of syntactic head, apposition, control and coordination, we explain how PASSAGE representation relates to other syntactic representation schemes for French and English, slightly extending the annotation to address English when needed. Second, we describe the XML format chosen for PASSAGE and show that it is compliant with the latest propositions in terms of linguistic annotation standard. We conclude discussing the influence that corpus-based evaluation has on the characteristics of syntactic representation when willing to assess the performance of any kind of parser.

pdf bib
Convertir des dérivations TAG en dépendances
Éric Villemonte De La Clergerie
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les structures de dépendances syntaxiques sont importantes et bien adaptées comme point de départ de diverses applications. Dans le cadre de l’analyseur TAG FRMG, nous présentons les détails d’un processus de conversion de forêts partagées de dérivations en forêts partagées de dépendances. Des éléments d’information sont fournis sur un algorithme de désambiguisation sur ces forêts de dépendances.

pdf bib
Exploitation de résultats d’analyse syntaxique pour extraction semi-supervisée des chemins de relations
Yayoi Nakamura-Delloye | Éric Villemonte De La Clergerie
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Le présent article décrit un travail en cours sur l’acquisition des patrons de relations entre entités nommées à partir de résultats d’analyse syntaxique. Sans aucun patron prédéfini, notre méthode fournit des chemins syntaxiques susceptibles de représenter une relation donnée à partir de quelques exemples de couples d’entités nommées entretenant la relation en question.

2009

pdf bib
Towards Efficient Production of Linguistic Resources: the Victoria Project
Lionel Nicolas | Miguel A. Molinero | Benoît Sagot | Elena Trigo | Éric de La Clergerie | Miguel Alonso Pardo | Jacques Farré | Joan Miquel Vergés
Proceedings of the International Conference RANLP-2009

pdf bib
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)
Harry Bunt | Éric Villemonte de la Clergerie
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Trouver et confondre les coupables : un processus sophistiqué de correction de lexique
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric Villemonte De La Clergerie
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La couverture d’un analyseur syntaxique dépend avant tout de la grammaire et du lexique sur lequel il repose. Le développement d’un lexique complet et précis est une tâche ardue et de longue haleine, surtout lorsque le lexique atteint un certain niveau de qualité et de couverture. Dans cet article, nous présentons un processus capable de détecter automatiquement les entrées manquantes ou incomplètes d’un lexique, et de suggérer des corrections pour ces entrées. La détection se réalise au moyen de deux techniques reposant soit sur un modèle statistique, soit sur les informations fournies par un étiqueteur syntaxique. Les hypothèses de corrections pour les entrées lexicales détectées sont générées en étudiant les modifications qui permettent d’améliorer le taux d’analyse des phrases dans lesquelles ces entrées apparaissent. Le processus global met en oeuvre plusieurs techniques utilisant divers outils tels que des étiqueteurs et des analyseurs syntaxiques ou des classifieurs d’entropie. Son application au Lefff , un lexique morphologique et syntaxique à large couverture du français, nous a déjà permis de réaliser des améliorations notables.

2008

pdf bib
Large Scale Production of Syntactic Annotations to Move Forward
Anne Vilnat | Gil Francopoulo | Olivier Hamon | Sylvain Loiseau | Patrick Paroubek | Eric Villemonte de la Clergerie
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

pdf bib
Fouille d’erreurs sur des sorties d’analyseurs syntaxiques [Error Mining on Syntactic Parser Output]
Benoît Sagot | Éric Villemonte de la Clergerie
Traitement Automatique des Langues, Volume 49, Numéro 1 : Varia [Varia]

pdf bib
Computer Aided Correction and Extension of a Syntactic Wide-Coverage Lexicon
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric de la Clergerie
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
PASSAGE: from French Parser Evaluation to Large Sized Treebank
Éric Villemonte de la Clergerie | Olivier Hamon | Djamel Mostefa | Christelle Ayache | Patrick Paroubek | Anne Vilnat
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present the PASSAGE project which aims at building automatically a French Treebank of large size by combining the output of several parsers, using the EASY annotation scheme. We present also the results of the of the first evaluation campaign of the project and the preliminary results we have obtained with our ROVER procedure for combining parsers automatically.

2007

pdf bib
Confondre le coupable : corrections d’un lexique suggérées par une grammaire
Lionel Nicolas | Jacques Farré | Éric Villemonte De La Clergerie
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Le succès de l’analyse syntaxique d’une phrase dépend de la qualité de la grammaire sous-jacente mais aussi de celle du lexique utilisé. Une première étape dans l’amélioration des lexiques consiste à identifier les entrées lexicales potentiellement erronées, par exemple en utilisant des techniques de fouilles d’erreurs sur corpus (Sagot & Villemonte de La Clergerie, 2006). Nous explorons ici l’étape suivante : la suggestion de corrections pour les entrées identifiées. Cet objectif est atteint au travers de réanalyses des phrases rejetées à l’étape précédente, après modification des informations portées par les entrées suspectées. Un calcul statistique sur les nouveaux résultats permet ensuite de mettre en valeur les corrections les plus pertinentes.

2006

pdf bib
Trouver le coupable : Fouille d’erreurs sur des sorties d’analyseurs syntaxiques
Benoît Sagot | Éric Villemonte De La Clergerie
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons une méthode de fouille d’erreurs pour détecter automatiquement des erreurs dans les ressources utilisées par les systèmes d’analyse syntaxique. Nous avons mis en oeuvre cette méthode sur le résultat de l’analyse de plusieurs millions de mots par deux systèmes d’analyse différents qui ont toutefois en commun le lexique syntaxique et la chaîne de traitement pré-syntaxique. Nous avons pu identifier ainsi des inexactitudes et des incomplétudes dans les ressources utilisées. En particulier, la comparaison des résultats obtenus sur les sorties des deux analyseurs sur un même corpus nous a permis d’isoler les problèmes issus des ressources partagées de ceux issus des grammaires.

pdf bib
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
Benoît Sagot | Lionel Clément | Éric Villemonte de La Clergerie | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fléchies du français - Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French. The web site of the Lefff is http://www.lefff.net/.

pdf bib
Error Mining in Parsing Results
Benoît Sagot | Éric de la Clergerie
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

pdf bib
From metagrammars to factorized TAG/TIG parsers
Éric Villemonte de la Clergerie
Proceedings of the Ninth International Workshop on Parsing Technology

pdf bib
Comment obtenir plus des Méta-Grammaires
François Thomasset | Éric Villemonte De La Clergerie
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente un environnement de développement pour les méta-grammaires (MG), utilisé pour concevoir rapidement une grammaire d’arbres adjoints (TAG) du français à large couverture et néanmoins très compacte, grâce à des factorisations d’arbres. Exploitant les fonctionnalités fournies par le système DYALOG, cette grammaire a permis de construire un analyseur syntaxique hybride TAG/TIG utilisé dans le cadre de la campagne d’évaluation syntaxique EASY.

pdf bib
Chaînes de traitement syntaxique
Pierre Boullier | Lionel Clément | Benoît Sagot | Éric Villemonte De La Clergerie
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article expose l’ensemble des outils que nous avons mis en oeuvre pour la campagne EASy d’évaluation d’analyse syntaxique. Nous commençons par un aperçu du lexique morphologique et syntaxique utilisé. Puis nous décrivons brièvement les propriétés de notre chaîne de traitement pré-syntaxique qui permet de gérer des corpus tout-venant. Nous présentons alors les deux systèmes d’analyse que nous avons utilisés, un analyseur TAG issu d’une méta-grammaire et un analyseur LFG. Nous comparons ces deux systèmes en indiquant leurs points communs, comme l’utilisation intensive du partage de calcul et des représentations compactes de l’information, mais également leurs différences, au niveau des formalismes, des grammaires et des analyseurs. Nous décrivons ensuite le processus de post-traitement, qui nous a permis d’extraire de nos analyses les informations demandées par la campagne EASy. Nous terminons par une évaluation quantitative de nos architectures.

2004

pdf bib
Towards an International Standard on Feature Structure Representation
Kiyong Lee | Lou Burnard | Laurent Romary | Eric de la Clergerie | Thierry Declerck | Syd Bauman | Harry Bunt | Lionel Clément | Tomaž Erjavec | Azim Roussanaly | Claude Roux
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
International Standard for a Linguistic Annotation Framework
Nancy Ide | Laurent Romary | Eric de la Clergerie
Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS)

2002

pdf bib
Parsing MCS languages with Thread Automata
Éric Villemonte de la Clergerie
Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6)

pdf bib
Construire des analyseurs avec DyALog
Éric Villemonte De La Clergerie
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article survole les fonctionnalités offertes par le système DyALog pour construire des analyseurs syntaxiques tabulaires. Offrant la richesse d’un environnement de programmation en logique, DyALog facilite l’écriture de grammaires, couvre plusieurs formalismes et permet le paramétrage de stratégies d’analyse.

pdf bib
Parsing Mildly Context-Sensitive Languages with Thread Automata
Éric Villemonte de la Clergerie
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Guided Parsing of Range Concatenation Languages
François Barthélemy | Pierre Boullier | Philippe Deschamp | Éric Villemonte de la Clergerie
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

bib
Refining Tabular Parsers for TAGs
Éric Villemonte de la Clergerie
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Atelier ATOLL pour les grammaires d’arbres adjoints
François Barthélemy | Pierre Boullier | Philippe Deschamp | Linda Kaouane | Éric Villemonte De La Clergerie
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente l’environnement de travail que nous développons au sein de l’équipe ATOLL pour les grammaires d’arbres adjoints. Cet environnement comprend plusieurs outils et ressources fondés sur l’emploi du langage de balisage XML. Ce langage facilite la mise en forme et l’échange de ressources linguistiques.

pdf bib
Tools and resources for Tree Adjoining Grammars
François Barthélemy | Pierre Bouiller | Philippe Deschamp | Linda Kaouane | Éric Villemonte de la Clergerie
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

2000

pdf bib
New Tabular Algorithms for Parsing
Miguel A. Alonso | Jorge Graña | Manuel Vilares | Eric de la Clergerie
Proceedings of the Sixth International Workshop on Parsing Technologies

We develop a set of new tabular parsing algorithms for Linear Indexed Grammars, including bottom-up algorithms and Earley-like algorithms with and without the valid prefix property, creating a continuum in which one algorithm can in turn be derived from another. The output of these algorithms is a shared forest in the form of a context-free grammar that encodes all possible derivations for a given input string.

pdf bib
A redefinition of Embedded Push-Down Automata
Miguel A. Alonso | Éric Villemonte de la Clergerie | Manuel Vilares
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

pdf bib
Practical aspects in compiling tabular TAG parsers
Miguel A. Alonso | Djamé Seddah | Éric Villemonte de la Clergerie
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

1999

pdf bib
Tabular Algorithms for TAG Parsing
Miguel A. Alonso | David Cabrero | Eric de la Clergerie | Manuel Vilares
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
A tabular interpretation of a class of 2-Stack Automata
Eric Villemonte de la Clergerie | Miguel Alonso Pardo
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
A tabular interpretation of a class of 2-Stack Automata
Eric Villemonte de la Clergerie | Miguel Alonso Pardo
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
A tabular interpretation of bottom-up automata for TAG
Eric de la Clergerie | Miguel A. Alonso Pardo | David Cabrero Souto
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

Search
Co-authors