Active Learning for Assisted Corpus Construction: A Case Study in Knowledge Discovery from Biomedical Text

Hian Cañizares-Díaz, Alejandro Piad-Morffis, Suilan Estevez-Velarde, Yoan Gutiérrez, Yudivián Almeida Cruz, Andres Montoyo, Rafael Muñoz-Guillena


Abstract
This paper presents an active learning approach that aims to reduce the human effort required during the annotation of natural language corpora composed of entities and semantic relations. Our approach assists human annotators by intelligently selecting the most informative sentences to annotate and then pre-annotating them with a few highly accurate entities and semantic relations. We define an uncertainty-based query strategy with a weighted density factor, using similarity metrics based on sentence embeddings. As a case study, we evaluate our approach via simulation in a biomedical corpus and estimate the potential reduction in total annotation time. Experimental results suggest that the query strategy reduces by between 35% and 40% the number of sentences that must be manually annotated to develop systems able to reach a target F1 score, while the pre-annotation strategy produces an additional 24% reduction in the total annotation time. Overall, our preliminary experiments suggest that as much as 60% of the annotation time could be saved while producing corpora that have the same usefulness for training machine learning algorithms. An open-source computational tool that implements the aforementioned strategies is presented and published online for the research community.
Anthology ID:
2021.ranlp-1.26
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
216–225
Language:
URL:
https://aclanthology.org/2021.ranlp-1.26
DOI:
Bibkey:
Cite (ACL):
Hian Cañizares-Díaz, Alejandro Piad-Morffis, Suilan Estevez-Velarde, Yoan Gutiérrez, Yudivián Almeida Cruz, Andres Montoyo, and Rafael Muñoz-Guillena. 2021. Active Learning for Assisted Corpus Construction: A Case Study in Knowledge Discovery from Biomedical Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 216–225, Held Online. INCOMA Ltd..
Cite (Informal):
Active Learning for Assisted Corpus Construction: A Case Study in Knowledge Discovery from Biomedical Text (Cañizares-Díaz et al., RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-1.26.pdf