2024
pdf
bib
abs
Developing a Rhetorical Structure Theory Treebank for Czech
Lucie Polakova
|
Jiří Mírovský
|
Šárka Zikánová
|
Eva Hajicova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.
pdf
bib
abs
DiscoGeM 2.0: A Parallel Corpus of English, German, French and Czech Implicit Discourse Relations
Frances Yung
|
Merel Scholman
|
Sarka Zikanova
|
Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-linguistic differences in discourse marking and could improve automatic discourse parsing applications. It is openly downloadable here: https://github.com/merelscholman/DiscoGeM.
pdf
bib
abs
Attitudes in Diplomatic Speeches: Introducing the CoDipA UNSC 1.0
Mariia Anisimova
|
Šárka Zikánová
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024
This paper presents CoDipA UNSC 1.0, a Corpus of Diplomatic Attitudes of the United Nations Security Council annotated with the attitude-part of the Appraisal theory. The speeches were manually selected according to topic-related and temporal criteria. The texts were then annotated according to the predefined annotation scenario. The distinguishing features of the diplomatic texts require a modified approach to attitude evaluation, which was implemented and presented in the current work. The corpus analysis has proven diplomatic speeches to be consistently evaluative, offered an overview of the most prominent means of expressing subjectivity in the corpus, and provided the results of the inter-annotator agreement evaluation.
2013
pdf
bib
Introducing the Prague Discourse Treebank 1.0
Lucie Poláková
|
Jiří Mírovský
|
Anna Nedoluzhko
|
Pavlína Jínová
|
Šárka Zikánová
|
Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing
2010
pdf
bib
abs
Typical Cases of Annotators’ Disagreement in Discourse Annotations in Prague Dependency Treebank
Šárka Zikánová
|
Lucie Mladová
|
Jiří Mírovský
|
Pavlína Jínová
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present the first results of the parallel Czech discourse annotation in the Prague Dependency Treebank 2.0. Having established an annotation scenario for capturing semantic relations crossing the sentence boundary in a discourse, and having annotated the first sections of the treebank according to these guidelines, we report now on the results of the first evaluation of these manual annotations. We give an overview of the process of the annotation itself, which we believe is to a large degree language-independent and therefore accessible to any discourse researcher. Next, we describe the inter-annotator agreement measurement, and, most importantly, we classify and analyze the most common types of annotators disagreement and propose solutions for the next phase of the annotation. The annotation is carried out on dependency trees (on the tectogrammatical layer), this approach is quite novel and it brings us some advantages when interpreting the syntactic structure of the discourse units.
pdf
bib
Connective-Based Measuring of the Inter-Annotator Agreement in the Annotation of Discourse in PDT
Jiří Mírovský
|
Lucie Mladová
|
Šárka Zikánová
Coling 2010: Posters
2008
pdf
bib
abs
From Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
Lucie Mladová
|
Šárka Zikánová
|
Eva Hajičová
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-day syntactico-semantic (tectogrammatical) annotation in the Prague Dependency Treebank, extend it for the purposes of a sentence-boundary-crossing representation and eventually to design a new, discourse level of annotation. In this paper, we propose a feasible process of such a transfer, comparing the possibilities the Praguian dependency-based approach offers with the Penn discourse annotation based primarily on the analysis and classification of discourse connectives.