2018
pdf
bib
Automatic Identification of Maghreb Dialects Using a Dictionary-Based Approach
Houda Saâdane
|
Hosni Seffih
|
Christian Fluhr
|
Khalid Choukri
|
Nasredine Semmar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
abs
Une approche linguistique pour la détection des dialectes arabes (A linguistic approach for the detection of Arabic dialects)
Houda Saâdane
|
Damien Nouvel
|
Hosni Seffih
|
Christian Fluhr
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts
Dans cet article, nous présentons un processus d’identification automatique de l’origine dialectale pour la langue arabe de textes écrits en caractères arabes ou en écriture latine (arabizi). Nous décrivons le processus d’annotation des ressources construites et du système de translittération adopté. Deux approches d’identification de la langue sont comparées : la première est linguistique et exploite des dictionnaires, la seconde est statistique et repose sur des méthodes traditionnelles d’apprentissage automatique (n-grammes). L’évaluation de ces approches montre que la méthode linguistique donne des résultats satisfaisants, sans être dépendante des corpus d’apprentissage.
2012
pdf
bib
abs
Using Arabic Transliteration to Improve Word Alignment from French- Arabic Parallel Corpora
Houda Saadane
|
Ouafa Benterki
|
Nasredine Semmar
|
Christian Fluhr
Fourth Workshop on Computational Approaches to Arabic-Script-based Languages
In this paper, we focus on the use of Arabic transliteration to improve the results of a linguistics-based word alignment approach from parallel text corpora. This approach uses, on the one hand, a bilingual lexicon, named entities, cognates and grammatical tags to align single words, and on the other hand, syntactic dependency relations to align compound words. We have evaluated the word aligner integrating Arabic transliteration using two methods: A manual evaluation of the alignment quality and an evaluation of the impact of this alignment on the translation quality by using the Moses statistical machine translation system. The obtained results show that Arabic transliteration improves the quality of both alignment and translation.
2007
pdf
bib
abs
Utilisation d’une approche basée sur la recherche cross-lingue d’information pour l’alignement de phrases à partir de textes bilingues Arabe-Français
Nasredine Semmar
|
Christian Fluhr
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
L’alignement de phrases à partir de textes bilingues consiste à reconnaître les phrases qui sont traductions les unes des autres. Cet article présente une nouvelle approche pour aligner les phrases d’un corpus parallèle. Cette approche est basée sur la recherche crosslingue d’information et consiste à construire une base de données des phrases du texte cible et considérer chaque phrase du texte source comme une requête à cette base. La recherche crosslingue utilise un analyseur linguistique et un moteur de recherche. L’analyseur linguistique traite aussi bien les documents à indexer que les requêtes et produit un ensemble de lemmes normalisés, un ensemble d’entités nommées et un ensemble de mots composés avec leurs étiquettes morpho-syntaxiques. Le moteur de recherche construit les fichiers inversés des documents en se basant sur leur analyse linguistique et retrouve les documents pertinents à partir de leur indexes. L’aligneur de phrases a été évalué sur un corpus parallèle Arabe-Français et les résultats obtenus montrent que 97% des phrases ont été correctement alignées.
pdf
bib
Arabic to French Sentence Alignment: Exploration of A Cross-language Information Retrieval Approach
Nasredine Semmar
|
Christian Fluhr
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
2006
pdf
bib
abs
Using Cross-language Information Retrieval for Sentence Alignment
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT
Cross-language information retrieval consists in providing a query in one language and searching documents in different languages. Retrieved documents are ordered by the probability of being relevant to the user's request with the highest ranked being considered the most relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents to be indexed. This system, designed to work on Arabic, Chinese, English, French, German and Spanish, is composed of a multilingual linguistic analyzer, a statistical analyzer, a reformulator, a comparator and a search engine. The multilingual linguistic analyzer includes a morphological analyzer, a part-of-speech tagger and a syntactic analyzer. In the case of Arabic, a clitic stemmer is added to the morphological analyzer to segment the input words into proclitics, simple forms and enclitics. The linguistic analyzer processes both documents to be indexed and queries to produce a set of normalized lemmas, a set of named entities and a set of nominal compounds with their morpho-syntactic tags. The statistical analyzer computes for documents to be indexed concept weights based on concept database frequencies. The comparator computes intersections between queries and documents and provides a relevance weight for each intersection. Before this comparison, the reformulator expands queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language or in different languages. The search engine retrieves the ranked, relevant documents from the indexes according to the corresponding reformulated query and then merges the results obtained for each language, taking into account the original words of the query and their weights in order to score the documents. Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpora based on the LIC2M cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering each sentence of the source text as a "query" to that database. The aligned bilingual parallel corpora can be used as a translation memory in a computer-aided translation tool.
pdf
bib
abs
Using Stemming in Morphological Analysis to Improve Arabic Information Retrieval
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Information retrieval (IR) consists in finding all relevant documents for a user query in a collection of documents. These documents are ordered by the probability of being relevant to the user’s query. The highest ranked document is considered to be the most likely relevant document. Natural Language Processing (NLP) for IR aims to transform the potentially ambiguous words of queries and documents into unambiguous internal representations on which matching and retrieval can take place. This transformation is generally achieved by several levels of linguistic analysis, morphological, syntactic and so forth. In this paper, we present the Arabic linguistic analyzer used in the LIC2M cross-lingual search engine. We focus on the morphological analyzer and particularly the clitic stemmer which segments the input words into proclitics, simple forms and enclitics. We demonstrate that stemming improves search engine recall and precision.
pdf
bib
abs
A Deep Linguistic Analysis for Cross-language Information Retrieval
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Cross-language information retrieval consists in providing a query in one language and searching documents in one or different languages. These documents are ordered by the probability of being relevant to the user's request. The highest ranked document is considered to be the most likely relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents. This system is composed of a linguistic analyzer, a statistic analyzer, a reformulator, a comparator and a search engine. The linguistic analysis processes both documents to be indexed and queries to extract concepts representing their content. This analysis includes a morphological analysis, a part-of-speech tagging and a syntactic analysis. In this paper, we present the deep linguistic analysis used in the LIC2M cross-lingual search engine and we will particularly focus on the impact of the syntactic analysis on the retrieval effectiveness.
pdf
bib
abs
Exploiting text for extracting image processing resources
Gregory Grefenstette
|
Fathi Debili
|
Christian Fluhr
|
Svitlana Zinger
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Much everyday knowledge about physical aspects of objects does not exist as computer data, though such computer-based knowledge will be needed to communicate with next generation voice-commanded personal robots as well in other applications involving visual scene recognition. The largest attempt at manually creating common-sense knowledge, the CYC project, has not yet produced the information needed for these tasks. A new direction is needed, based on an automated approach to knowledge extraction. In this article we present our project to mine web text to find properties of objects that are not currently stored in computer readable form.
2002
pdf
bib
Building domain specific lexical hierarchies from corpora
Olivier Ferret
|
Christian Fluhr
|
Françoise Rousseau-Hans
|
Jean-Luc Simoni
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2000
pdf
bib
Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries
Vera Fluhr-Semenova
|
Christian Fluhr
|
Stéphanie Brisson
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
1982
pdf
bib
Methodes D’apprentissage Pour L’analyse Automatique Morphosyntaxique Et Lexicale-Semantique De La Langue Espagnole
A. Andreewsky
|
M. Desi
|
C. Fluhr
Coling 1982 Abstracts: Proceedings of the Ninth International Conference on Computational Linguistics Abstracts
1973
pdf
bib
Experience De Constitution D’un Programme D’apprentissage Pour Le Traitement Automatique Du Langage
A. Andreewsky
|
C. Fluhr
COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics
pdf
bib
Algorithmes De Generation Automatique Experience De Generation Des Phrases Simples Du Francais
A. Andreewsky
|
C. Fluhr
|
J. Rambousek
COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics