The MARCELL Legislative Corpus
Tamás Váradi, Svetla Koeva, Martin Yamalov, Marko Tadić, Bálint Sass, Bartłomiej Nitoń, Maciej Ogrodniczuk, Piotr Pęzik, Verginica Barbu Mititelu, Radu Ion, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraz Repar, Matjaž Rihtar, Janez Brank
Abstract
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.- Anthology ID:
- 2020.lrec-1.464
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3761–3768
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.464
- DOI:
- Bibkey:
- Cite (ACL):
- Tamás Váradi, Svetla Koeva, Martin Yamalov, Marko Tadić, Bálint Sass, Bartłomiej Nitoń, Maciej Ogrodniczuk, Piotr Pęzik, Verginica Barbu Mititelu, Radu Ion, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraz Repar, Matjaž Rihtar, and Janez Brank. 2020. The MARCELL Legislative Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3761–3768, Marseille, France. European Language Resources Association.
- Cite (Informal):
- The MARCELL Legislative Corpus (Váradi et al., LREC 2020)
- Copy Citation:
- PDF:
- https://aclanthology.org/2020.lrec-1.464.pdf
Export citation
@inproceedings{varadi-etal-2020-marcell, title = "The {MARCELL} Legislative Corpus", author = "V{\'a}radi, Tam{\'a}s and Koeva, Svetla and Yamalov, Martin and Tadi{\'c}, Marko and Sass, B{\'a}lint and Nito{\'n}, Bart{\l}omiej and Ogrodniczuk, Maciej and P{\k{e}}zik, Piotr and Barbu Mititelu, Verginica and Ion, Radu and Irimia, Elena and Mitrofan, Maria and P{\u{a}}i{\textcommabelow{s}}, Vasile and Tufi{\textcommabelow{s}}, Dan and Garab{\'\i}k, Radovan and Krek, Simon and Repar, Andraz and Rihtar, Matja{\v{z}} and Brank, Janez", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.464", pages = "3761--3768", abstract = "This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.", language = "English", ISBN = "979-10-95546-34-4", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="varadi-etal-2020-marcell"> <titleInfo> <title>The MARCELL Legislative Corpus</title> </titleInfo> <name type="personal"> <namePart type="given">Tamás</namePart> <namePart type="family">Váradi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Svetla</namePart> <namePart type="family">Koeva</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Martin</namePart> <namePart type="family">Yamalov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marko</namePart> <namePart type="family">Tadić</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bálint</namePart> <namePart type="family">Sass</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bartłomiej</namePart> <namePart type="family">Nitoń</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Maciej</namePart> <namePart type="family">Ogrodniczuk</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Piotr</namePart> <namePart type="family">Pęzik</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Verginica</namePart> <namePart type="family">Barbu Mititelu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Radu</namePart> <namePart type="family">Ion</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Elena</namePart> <namePart type="family">Irimia</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Maria</namePart> <namePart type="family">Mitrofan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vasile</namePart> <namePart type="family">Păi\textcommabelows</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dan</namePart> <namePart type="family">Tufi\textcommabelows</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Radovan</namePart> <namePart type="family">Garabík</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Simon</namePart> <namePart type="family">Krek</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andraz</namePart> <namePart type="family">Repar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matjaž</namePart> <namePart type="family">Rihtar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Janez</namePart> <namePart type="family">Brank</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2020-05</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <language> <languageTerm type="text">English</languageTerm> <languageTerm type="code" authority="iso639-2b">eng</languageTerm> </language> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Twelfth Language Resources and Evaluation Conference</title> </titleInfo> <name type="personal"> <namePart type="given">Nicoletta</namePart> <namePart type="family">Calzolari</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Frédéric</namePart> <namePart type="family">Béchet</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philippe</namePart> <namePart type="family">Blache</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Khalid</namePart> <namePart type="family">Choukri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christopher</namePart> <namePart type="family">Cieri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Thierry</namePart> <namePart type="family">Declerck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sara</namePart> <namePart type="family">Goggi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hitoshi</namePart> <namePart type="family">Isahara</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bente</namePart> <namePart type="family">Maegaard</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joseph</namePart> <namePart type="family">Mariani</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hélène</namePart> <namePart type="family">Mazo</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Asuncion</namePart> <namePart type="family">Moreno</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Odijk</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stelios</namePart> <namePart type="family">Piperidis</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>European Language Resources Association</publisher> <place> <placeTerm type="text">Marseille, France</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> <identifier type="isbn">979-10-95546-34-4</identifier> </relatedItem> <abstract>This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.</abstract> <identifier type="citekey">varadi-etal-2020-marcell</identifier> <location> <url>https://aclanthology.org/2020.lrec-1.464</url> </location> <part> <date>2020-05</date> <extent unit="page"> <start>3761</start> <end>3768</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T The MARCELL Legislative Corpus %A Váradi, Tamás %A Koeva, Svetla %A Yamalov, Martin %A Tadić, Marko %A Sass, Bálint %A Nitoń, Bartłomiej %A Ogrodniczuk, Maciej %A Pęzik, Piotr %A Barbu Mititelu, Verginica %A Ion, Radu %A Irimia, Elena %A Mitrofan, Maria %A Păi\textcommabelows, Vasile %A Tufi\textcommabelows, Dan %A Garabík, Radovan %A Krek, Simon %A Repar, Andraz %A Rihtar, Matjaž %A Brank, Janez %Y Calzolari, Nicoletta %Y Béchet, Frédéric %Y Blache, Philippe %Y Choukri, Khalid %Y Cieri, Christopher %Y Declerck, Thierry %Y Goggi, Sara %Y Isahara, Hitoshi %Y Maegaard, Bente %Y Mariani, Joseph %Y Mazo, Hélène %Y Moreno, Asuncion %Y Odijk, Jan %Y Piperidis, Stelios %S Proceedings of the Twelfth Language Resources and Evaluation Conference %D 2020 %8 May %I European Language Resources Association %C Marseille, France %@ 979-10-95546-34-4 %G English %F varadi-etal-2020-marcell %X This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification. %U https://aclanthology.org/2020.lrec-1.464 %P 3761-3768
Markdown (Informal)
[The MARCELL Legislative Corpus](https://aclanthology.org/2020.lrec-1.464) (Váradi et al., LREC 2020)
- The MARCELL Legislative Corpus (Váradi et al., LREC 2020)
ACL
- Tamás Váradi, Svetla Koeva, Martin Yamalov, Marko Tadić, Bálint Sass, Bartłomiej Nitoń, Maciej Ogrodniczuk, Piotr Pęzik, Verginica Barbu Mititelu, Radu Ion, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraz Repar, Matjaž Rihtar, and Janez Brank. 2020. The MARCELL Legislative Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3761–3768, Marseille, France. European Language Resources Association.