This paper presents a corpus manually annotated with named entities for six Slavic languages — Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017–2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits — single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models — XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
This paper describes Slav-NER: the 4th Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. This version of the Challenge covers three languages and five entity types. It is organized as part of the 9th Slavic Natural Language Processing Workshop, co-located with the EACL 2023 Conference.Seven teams registered and three participated actively in the competition. Performance for the named entity recognition and normalization tasks reached 90% F1 measure, much higher than reported in the first edition of the Challenge, but similar to the results reported in the latest edition. Performance for the entity linking task for individual language reached the range of 72-80% F1 measure. Detailed evaluation information is available on the Shared Task web page.
In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experiments’ results showed that wordnet-based similarity measures could compete and even outperform modern embedding-based approaches.
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.
In the paper, we focus on modeling spatial expressions in texts. We present the guidelines used to annotate the PST 2.0 (Corpus of Polish Spatial Texts) — a corpus designed for training and testing the tools for spatial expression recognition. The corpus contains a set of texts gathered from texts collected from travel blogs available under Creative Commons license. We have defined our guidelines based on three existing specifications for English (SpatialML, SpatialRole Labelling from SemEval-2013 Task 3 and ISO-Space1.4 from SpaceEval 2014). We briefly present the existing specifications and discuss what modifications have been made to adapt the guidelines to the characteristics of the Polish language. We also describe the process of data collection and manual annotation, including inter-annotator agreement calculation and corpus statistics. In the end, we present detailed statistics of the PST 2.0 corpus, which include the number of components, relations, expressions, and the most common values of spatial indicators, motion indicators, path indicators, distances, directions, and regions.
In the paper we present the latest changes introduce to Inforex — a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/CLARIN-PL/Inforex) as an open source project. The system can be easily setup and run in a Docker container what simplifies the installation process. The major improvements include: semi-automatic text annotation, multilingual text preprocessing using CLARIN-PL web services, morphological tagging of XML documents, improved editor for annotation attribute, batch annotation attribute editor, morphological disambiguation, extended word sense annotation. This paper contains a brief description of the mentioned improvements. We also present two use cases in which various Inforex features were used and tested in real-life projects.
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.
In this article we present the result of the recent research in the recognition and normalisation of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We proposed a new method for the temporal expressions normalisation, called Cascade of Partial Rules. Here we describe results achieved by updated version of Liner2 machine learning system.
We report a first major upgrade of Inforex — a web-based system for qualitative and collaborative text corpora annotation and analysis. Inforex is a part of Polish CLARIN infrastructure. It is integrated with a digital repository for storing and publishing language resources and allows to visualize, browse and annotate text corpora stored in the repository. As a result of a series of workshops for researches from humanities and social sciences fields we improved the graphical interface to make the system more friendly and readable for non-experienced users. We also implemented a new functionality for gold standard annotation which includes private annotations and annotation agreement by a super-annotator.
In the paper we present a tool for lemmatization of multi-word common noun phrases and named entities for Polish called LemmaPL. The tool is based on a set of manually crafted rules and heuristics utilizing a set of dictionaries (including morphological, named entities and inflection patterns). The accuracy of lemmatization obtained by the tool reached 97.99% on a dataset with multi-word common noun phrases and 86.17% for case-sensitive evaluation on a dataset with named entities.
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text pre-processing including text segmentation, morphosyntactic analysis and word selection for word sense annotation. Inforex can be accessed from any standard-compliant web browser supporting JavaScript. The user interface has a form of dynamic HTML pages using the AJAX technology. The server part of the system is written in PHP and the data is stored in MySQL database. The system make use of some external tools that are installed on the server or can be accessed via web services. The documents are stored in the database in the original format ― either plain text, XML or HTML. Tokenization and sentence segmentation is optional and is stored in a separate table. Tokens are stored as pairs of values representing indexes of first and last character of the tokens and sets of features representing the morpho-syntactic information.
This paper presents our efforts aimed at collecting and annotating a free Polish corpus. The corpus will serve for us as training and testing material for experiments with Machine Learning algorithms. As others may also benefit from the resource, we are going to release it under a Creative Commons licence, which is hoped to remove unnecessary usage restrictions, but also to facilitate reproduction of our experimental results. The corpus is being annotated with various types of linguistic entities: chunks and named entities, selected syntactic and semantic relations, word senses and anaphora. We report on the current state of the project as well as our ultimate goals.
A limited prototype of the CLARIN Language Technology Infrastructure (LTI) node is presented. The node prototype provides several types of web services for Polish. The functionality encompasses morpho-syntactic processing, shallow semantic processing of corpus on the basis of the SuperMatrix system and plWordNet browsing. We take the prototype as the starting point for the discussion on requirements that must be fulfilled by the LTI. Some possible solutions are proposed for less frequently discussed problems, e.g. streaming processing of language data on the remote processing node. We experimentally investigate how to tackle with several requirements from many discussed. Such aspects as processing large volumes of data, asynchronous mode of processing and scalability of the architecture to large number of users got especial attention in the constructed prototype of the Web Service for morpho-syntactic processing of Polish called TaKIPI-WS (http://plwordnet.pwr.wroc.pl/clarin/ws/takipi/). TaKIPI-WS is a distributed system with a three-layer architecture, an asynchronous model of request handling and multi-agent-based processing. TaKIPI-WS consists of three layers: WS Interface, Database and Daemons. The role of the Database is to store and exchange data between the Interface and the Daemons. The Daemons (i.e. taggers) are responsible for executing the requests queued in the database. Results of the performance tests are presented in the paper, too.
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.