Martin Yalamov
2020
Categorisation of Bulgarian Legislative Documents
Nikola Obreshkov
|
Martin Yalamov
|
Svetla Koeva
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
The paper presents the categorisation of Bulgarian MARCELL corpus in toplevel EuroVoc domains. The Bulgarian MARCELL corpus is part of a recently developed multilingual corpus representing the national legislation in seven European countries. We performed several experiments with JEX Indexer, with neural networks and with a basic method measuring the domain-specific terms in documents annotated in advance with IATE terms and EuroVoc descriptors (combined with grouping of a primary document and its satellites, term extraction and parsing of the titles of the documents). The evaluation shows slight overweight of the basic method, which makes it appropriate as the categorisation should be a module of a NLP Pipeline for Bulgarian that is continuously feeding and annotating the Bulgarian MARCELL corpus with newly issued legislative documents.
Natural Language Processing Pipeline to Annotate Bulgarian Legislative Documents
Svetla Koeva
|
Nikola Obreshkov
|
Martin Yalamov
Proceedings of the Twelfth Language Resources and Evaluation Conference
The paper presents the Bulgarian MARCELL corpus, part of a recently developed multilingual corpus representing the national legislation in seven European countries and the NLP pipeline that turns the web crawled data into structured, linguistically annotated dataset. The Bulgarian data is web crawled, extracted from the original HTML format, filtered by document type, tokenised, sentence split, tagged and lemmatised with a fine-grained version of the Bulgarian Language Processing Chain, dependency parsed with NLP- Cube, annotated with named entities (persons, locations, organisations and others), noun phrases, IATE terms and EuroVoc descriptors. An orchestrator process has been developed to control the NLP pipeline performing an end-to-end data processing and annotation starting from the documents identification and ending in the generation of statistical reports. The Bulgarian MARCELL corpus consists of 25,283 documents (at the beginning of November 2019), which are classified into eleven types.
Search