Stefan Fischer


2024

pdf bib
Multi-word Expressions in English Scientific Writing
Diego Alves | Stefan Fischer | Stefania Degaetano-Ortlieb | Elke Teich
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

Multi-Word Expressions (MWEs) play a pivotal role in language use overall and in register formation more specifically, e.g. encoding field-specific terminology. Our study focuses on the identification and categorization of MWEs used in scientific writing, considering their formal characteristics as well as their developmental trajectory over time from the mid-17th century to the present. For this, we develop an approach combining three different types of methods to identify MWEs (Universal Dependency annotation, Partitioner and the Academic Formulas List) and selected measures to characterize MWE properties (e.g., dispersion by Kullback-Leibler Divergence and several association measures). This allows us to inspect MWEs types in a novel data-driven way regarding their functions and change over time in specialized discourse.

pdf bib
A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication
Stefan Fischer | Kateryna Haidarzhyi | Jörg Knappen | Olha Polishchuk | Yuliya Stodolinska | Elke Teich
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024

We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.

pdf bib
Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal
Julius Steuer | Marie-Pauline Krielke | Stefan Fischer | Stefania Degaetano-Ortlieb | Marius Mosbach | Dietrich Klakow
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

2022

pdf bib
EPIC UdS - Creation and Applications of a Simultaneous Interpreting Corpus
Heike Przybyl | Ekaterina Lapshinova-Koltunski | Katrin Menzel | Stefan Fischer | Elke Teich
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we describe the creation and annotation of EPIC UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the comparable and parallel, aligned corpus variants and explore various applications of the corpus. What makes EPIC UdS relevant is that it is one of the rare interpreting corpora that includes transcripts suitable for research on more than one language pair and on interpreting with regard to German. It not only contains transcribed speeches, but also rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields.

2020

pdf bib
The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study
Stefan Fischer | Jörg Knappen | Katrin Menzel | Elke Teich
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.

2017

pdf bib
The Making of the Royal Society Corpus
Jörg Knappen | Stefan Fischer | Hannah Kermes | Elke Teich | Peter Fankhauser
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

2016

pdf bib
Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts
Anne-Kathrin Schumann | Stefan Fischer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known about these processes. We therefore present a corpus annotation project aiming at revealing how terms are coined and how they evolve to fit their function as semantically and morpho-syntactically dense encodings of specialised knowledge. The scope of this paper is two-fold: Firstly, we outline our methodology for annotating terminology in a diachronic corpus of scientific publications. Moreover, we provide a detailed analysis of our annotation results and suggest methods for improving the accuracy of annotations in a setting as difficult as ours. Secondly, we present results of a pilot study based on the annotated terms. The results suggest that terms in older texts are linguistically relatively simple units that are hard to distinguish from the lexicon of general language. We believe that this supports our hypothesis that terminology undergoes diachronic processes of densification and specialisation.

2015

pdf bib
Vector-space calculation of semantic surprisal for predicting word pronunciation duration
Asad Sayeed | Stefan Fischer | Vera Demberg
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)