2022
pdf
bib
abs
Spicy Salmon: Converting between 50+ Annotation Formats with Fintan, Pepper, Salt and Powla
Christian Fäth
|
Christian Chiarcos
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
Heterogeneity of formats, models and annotations has always been a primary hindrance for exploiting the ever increasing amount of existing linguistic resources for real world applications in and beyond NLP. Fintan - the Flexible INtegrated Transformation and Annotation eNgineering platform introduced in 2020 is designed to rapidly convert, combine and manipulate language resources both in and outside the Semantic Web by transforming it into segmented RDF representations which can be processed in parallel on a multithreaded environment and integrating it with ontologies and taxonomies. Fintan has recently been extended with a set of additional modules increasing the amount of supported non-RDF formats and the interoperability with existing non-JAVA conversion tools, and parts of this work are demonstrated in this paper. In particular, we focus on a novel recipe for resource transformation in which Fintan works in tandem with the Pepper toolset to allow computational linguists to transform their data between over 50 linguistic corpus formats with a graphical workflow manager.
pdf
bib
abs
Querying a Dozen Corpora and a Thousand Years with Fintan
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Large-scale diachronic corpus studies covering longer time periods are difficult if more than one corpus are to be consulted and, as a result, different formats and annotation schemas need to be processed and queried in a uniform, comparable and replicable manner. We describes the application of the Flexible Integrated Transformation and Annotation eNgineering (Fintan) platform for studying word order in German using syntactically annotated corpora that represent its entire written history. Focusing on nominal dative and accusative arguments, this study hints at two major phases in the development of scrambling in modern German. Against more recent assumptions, it supports the traditional view that word order flexibility decreased over time, but it also indicates that this was a relatively sharp transition in Early New High German. The successful case study demonstrates the potential of Fintan and the underlying LLOD technology for historical linguistics, linguistic typology and corpus linguistics. The technological contribution of this paper is to demonstrate the applicability of Fintan for querying across heterogeneously annotated corpora, as previously, it had only been applied for transformation tasks. With its focus on quantitative analysis, Fintan is a natural complement for existing multi-layer technologies that focus on query and exploration.
pdf
bib
abs
Unifying Morphology Resources with OntoLex-Morph. A Case Study in German
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The OntoLex vocabulary has become a widely used community standard for machine-readable lexical resources on the web. The primary motivation to use OntoLex in favor of tool- or application-specific formalisms is to facilitate interoperability and information integration across different resources. One of its extension that is currently being developed is a module for representing morphology, OntoLex-Morph. In this paper, we show how OntoLex-Morph can be used for the encoding and integration of different types of morphological resources on a unified basis. With German as the example, we demonstrate it for (a) a full-form dictionary with inflection information (Unimorph), (b) a dictionary of base forms and their derivations (UDer), (c) a dictionary of compounds (from GermaNet), and (d) lexicon and inflection rules of a finite-state parser/generator (SMOR/Morphisto). These data are converted to OntoLex-Morph, their linguistic information is consolidated and corresponding lexical entries are linked with each other.
2020
pdf
bib
abs
Translation Inference by Concept Propagation
Christian Chiarcos
|
Niko Schenk
|
Christian Fäth
Proceedings of the 2020 Globalex Workshop on Linked Lexicography
This paper describes our contribution to the Third Shared Task on Translation Inference across Dictionaries (TIAD-2020). We describe an approach on translation inference based on symbolic methods, the propagation of concepts over a graph of interconnected dictionaries: Given a mapping from source language words to lexical concepts (e.g., synsets) as a seed, we use bilingual dictionaries to extrapolate a mapping of pivot and target language words to these lexical concepts. Translation inference is then performed by looking up the lexical concept(s) of a source language word and returning the target language word(s) for which these lexical concepts have the respective highest score. We present two instantiations of this system: One using WordNet synsets as concepts, and one using lexical entries (translations) as concepts. With a threshold of 0, the latter configuration is the second among participant systems in terms of F1 score. We also describe additional evaluation experiments on Apertium data, a comparison with an earlier approach based on embedding projection, and an approach for constrained projection that outperforms the TIAD-2020 vanilla system by a large margin.
pdf
bib
abs
On the Linguistic Linked Open Data Infrastructure
Christian Chiarcos
|
Bettina Klimek
|
Christian Fäth
|
Thierry Declerck
|
John Philip McCrae
Proceedings of the 1st International Workshop on Language Technology Platforms
In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories. We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD.
pdf
bib
abs
Annohub – Annotation Metadata for Linked Data Applications
Frank Abromeit
|
Christian Fäth
|
Luis Glaser
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
We introduce a new dataset for the Linguistic Linked Open Data (LLOD) cloud that will provide metadata about annotation and language information harvested from annotated language resources like corpora freely available on the internet. To our knowledge annotation metadata is not provided by any metadata provider, e.g. linghub, datahub or CLARIN so far. On the other hand, language metadata that is found on such portals is rarely provided in machine-readable form, especially as Linked Data. In this paper, we describe the harvesting process, content and structure of the new dataset and its application in the Lin|gu|is|tik portal, a research platform for linguists. Aside from that, we introduce tools for the conversion of XML encoded language resources to the CoNLL format. The generated RDF data as well as the XML-converter application are made public under an open license.
pdf
bib
abs
The ACoLi Dictionary Graph
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we report the release of the ACoLi Dictionary Graph, a large-scale collection of multilingual open source dictionaries available in two machine-readable formats, a graph representation in RDF, using the OntoLex-Lemon vocabulary, and a simple tabular data format to facilitate their use in NLP tasks, such as translation inference across dictionaries. We describe the mapping and harmonization of the underlying data structures into a unified representation, its serialization in RDF and TSV, and the release of a massive and coherent amount of lexical data under open licenses.
pdf
bib
abs
Recent Developments for the Linguistic Linked Open Data Infrastructure
Thierry Declerck
|
John Philip McCrae
|
Matthias Hartung
|
Jorge Gracia
|
Christian Chiarcos
|
Elena Montiel-Ponsoda
|
Philipp Cimiano
|
Artem Revenko
|
Roser Saurí
|
Deirdre Lee
|
Stefania Racioppa
|
Jamal Abdul Nasir
|
Matthias Orlikowsk
|
Marta Lanau-Coronas
|
Christian Fäth
|
Mariano Rico
|
Mohammad Fazleh Elahi
|
Maria Khvalchik
|
Meritxell Gonzalez
|
Katharine Cooney
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.
pdf
bib
abs
Annotation Interoperability for the Post-ISOCat Era
Christian Chiarcos
|
Christian Fäth
|
Frank Abromeit
Proceedings of the Twelfth Language Resources and Evaluation Conference
With this paper, we provide an overview over ISOCat successor solutions and annotation standardization efforts since 2010, and we describe the low-cost harmonization of post-ISOCat vocabularies by means of modular, linked ontologies: The CLARIN Concept Registry, LexInfo, Universal Parts of Speech, Universal Dependencies and UniMorph are linked with the Ontologies of Linguistic Annotation and through it with ISOCat, the GOLD ontology, the Typological Database Systems ontology and a large number of annotation schemes.
pdf
bib
abs
Fintan - Flexible, Integrated Transformation and Annotation eNgineering
Christian Fäth
|
Christian Chiarcos
|
Björn Ebbrecht
|
Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.
2018
pdf
bib
Universal Morphologies for the Caucasus region
Christian Chiarcos
|
Kathrin Donandt
|
Maxim Ionov
|
Monika Rind-Pawlowski
|
Hasmik Sargsian
|
Jesse Wichers Schreur
|
Frank Abromeit
|
Christian Fäth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Analyzing Middle High German Syntax with RDF and SPARQL
Christian Chiarcos
|
Benjamin Kosmehl
|
Christian Fäth
|
Maria Sukhareva
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Interoperability of Language-related Information: Mapping the BLL Thesaurus to Lexvo and Glottolog
Vanya Dimitrova
|
Christian Fäth
|
Christian Chiarcos
|
Heike Renner-Westermann
|
Frank Abromeit
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Lin|gu|is|tik: Building the Linguist’s Pathway to Bibliographies, Libraries, Language Resources and Linked Open Data
Christian Chiarcos
|
Christian Fäth
|
Heike Renner-Westermann
|
Frank Abromeit
|
Vanya Dimitrova
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper introduces a novel research tool for the field of linguistics: The Lin|gu|is|tik web portal provides a virtual library which offers scientific information on every linguistic subject. It comprises selected internet sources and databases as well as catalogues for linguistic literature, and addresses an interdisciplinary audience. The virtual library is the most recent outcome of the Special Subject Collection Linguistics of the German Research Foundation (DFG), and also integrates the knowledge accumulated in the Bibliography of Linguistic Literature. In addition to the portal, we describe long-term goals and prospects with a special focus on ongoing efforts regarding an extension towards integrating language resources and Linguistic Linked Open Data.