2021
pdf
bib
HeiCiC: A simultaneous interpreting corpus combining product and pre-process data
Kerstin Kunz
|
Christoph Stoll
|
Eva Klüber
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
2020
pdf
bib
abs
Exploring Coreference Features in Heterogeneous Data
Ekaterina Lapshinova-Koltunski
|
Kerstin Kunz
Proceedings of the First Workshop on Computational Approaches to Discourse
The present paper focuses on variation phenomena in coreference chains. We address the hypothesis that the degree of structural variation between chain elements depends on language-specific constraints and preferences and, even more, on the communicative situation of language production. We define coreference features that also include reference to abstract entities and events. These features are inspired through several sources – cognitive parameters, pragmatic factors and typological status. We pay attention to the distributions of these features in a dataset containing English and German texts of spoken and written discourse mode, which can be classified into seven different registers. We apply text classification and feature selection to find out how these variational dimensions (language, mode and register) impact on coreference features. Knowledge on the variation under analysis is valuable for contrastive linguistics, translation studies and multilingual natural language processing (NLP), e.g. machine translation or cross-lingual coreference resolution.
2016
pdf
bib
Beyond Identity Coreference: Contrasting Indicators of Textual Coherence in English and German
Kerstin Kunz
|
Ekaterina Lapshinova-Koltunski
|
José Manuel Martínez
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)
pdf
bib
abs
From Interoperable Annotations towards Interoperable Resources: A Multilingual Approach to the Analysis of Discourse
Ekaterina Lapshinova-Koltunski
|
Kerstin Anna Kunz
|
Anna Nedoluzhko
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In the present paper, we analyse variation of discourse phenomena in two typologically different languages, i.e. in German and Czech. The novelty of our approach lies in the nature of the resources we are using. Advantage is taken of existing resources, which are, however, annotated on the basis of two different frameworks. We use an interoperable scheme unifying discourse phenomena in both frameworks into more abstract categories and considering only those phenomena that have a direct match in German and Czech. The discourse properties we focus on are relations of identity, semantic similarity, ellipsis and discourse relations. Our study shows that the application of interoperable schemes allows an exploitation of discourse-related phenomena analysed in different projects and on the basis of different frameworks. As corpus compilation and annotation is a time-consuming task, positive results of this experiment open up new paths for contrastive linguistics, translation studies and NLP, including machine translation.
2015
pdf
bib
Across Languages and Genres: Creating a Universal Annotation Scheme for Textual Relations
Ekaterina Lapshinova-Koltunski
|
Anna Nedoluzhko
|
Kerstin Anna Kunz
Proceedings of the 9th Linguistic Annotation Workshop
2012
pdf
bib
abs
Coreference in Spoken vs. Written Texts: a Corpus-based Analysis
Marilisa Amoia
|
Kerstin Kunz
|
Ekaterina Lapshinova-Koltunski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes an empirical study of coreference in spoken vs. written text. We focus on the comparison of two particular text types, interviews and popular science texts, as instances of spoken and written texts since they display quite different discourse structures. We believe in fact, that the correlation of difficulties in coreference resolution and varying discourse structures requires a deeper analysis that accounts for the diversity of coreference strategies or their sub-phenomena as indicators of text type or genre. In this work, we therefore aim at defining specific parameters that classify differences in genres of spoken and written texts such as the preferred segmentation strategy, the maximal allowed distance in or the length and size of coreference chains as well as the correlation of structural and syntactic features of coreferring expressions. We argue that a characterization of such genre dependent parameters might improve the performance of current state-of-art coreference resolution technology.
2011
pdf
bib
Discontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying
Marilisa Amoia
|
Kerstin Kunz
|
Ekaterina Lapshinova-Koltunski
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora