Ivan Lazichny
2022
ALToolbox: A Set of Tools for Active Learning Annotation of Natural Language Texts
Akim Tsvigun
|
Leonid Sanochkin
|
Daniil Larionov
|
Gleb Kuzmin
|
Artem Vazhentsev
|
Ivan Lazichny
|
Nikita Khromov
|
Danil Kireev
|
Aleksandr Rubashevskii
|
Olga Shahmatova
|
Dmitry V. Dylov
|
Igor Galitskiy
|
Artem Shelmanov
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present ALToolbox – an open-source framework for active learning (AL) annotation in natural language processing. Currently, the framework supports text classification, sequence tagging, and seq2seq tasks. Besides state-of-the-art query strategies, ALToolbox provides a set of tools that help to reduce computational overhead and duration of AL iterations and increase annotated data reusability. The framework aims to support data scientists and researchers by providing an easy-to-deploy GUI annotation tool directly in the Jupyter IDE and an extensible benchmark for novel AL methods. We prepare a small demonstration of ALToolbox capabilities available online. The code of the framework is published under the MIT license.
Active Learning for Abstractive Text Summarization
Akim Tsvigun
|
Ivan Lysenko
|
Danila Sedashov
|
Ivan Lazichny
|
Eldar Damirov
|
Vladimir Karlov
|
Artemy Belousov
|
Leonid Sanochkin
|
Maxim Panov
|
Alexander Panchenko
|
Mikhail Burtsev
|
Artem Shelmanov
Findings of the Association for Computational Linguistics: EMNLP 2022
Construction of human-curated annotated datasets for abstractive text summarization (ATS) is very time-consuming and expensive because creating each instance requires a human annotator to read a long document and compose a shorter summary that would preserve the key information relayed by the original document. Active Learning (AL) is a technique developed to reduce the amount of annotation required to achieve a certain level of machine learning model performance. In information extraction and text classification, AL can reduce the amount of labor up to multiple times. Despite its potential for aiding expensive annotation, as far as we know, there were no effective AL query strategies for ATS. This stems from the fact that many AL strategies rely on uncertainty estimation, while as we show in our work, uncertain instances are usually noisy, and selecting them can degrade the model performance compared to passive annotation. We address this problem by proposing the first effective query strategy for AL in ATS based on diversity principles. We show that given a certain annotation budget, using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores. Additionally, we analyze the effect of self-learning and show that it can additionally increase the performance of the model.
Search
Co-authors
- Akim Tsvigun 2
- Leonid Sanochkin 2
- Artem Shelmanov 2
- Daniil Larionov 1
- Gleb Kuzmin 1
- show all...