Generating Distributable Surrogate Corpus for Medical Multi-label Classification

Seiji Shimizu, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki


Abstract
In medical and social media domains, annotated corpora are often hard to distribute due to copyrights and privacy issues. To overcome this situation, we propose a new method to generate a surrogate corpus for a downstream task by using a text generation model. We chose a medical multi-label classification task, MedWeb, in which patient-generated short messages express multiple symptoms. We first fine-tuned text generation models with different prompting designs on the original corpus to obtain synthetic versions of that corpus. To assess the viability of the generated corpora for the downstream task, we compared the performance of multi-label classification models trained either on the original or the surrogate corpora. The results and the error analysis showed the difficulty of generating surrogate corpus in multi-label settings, suggesting text generation under complex conditions is not trivial. On the other hand, our experiment demonstrates that the generated corpus with a sentinel-based prompting is comparatively viable in a single-label (multiclass) classification setting.
Anthology ID:
2024.cl4health-1.19
Volume:
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Paul Thompson, Brian Ondov
Venues:
CL4Health | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
153–162
Language:
URL:
https://aclanthology.org/2024.cl4health-1.19
DOI:
Bibkey:
Cite (ACL):
Seiji Shimizu, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2024. Generating Distributable Surrogate Corpus for Medical Multi-label Classification. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024, pages 153–162, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Generating Distributable Surrogate Corpus for Medical Multi-label Classification (Shimizu et al., CL4Health-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.cl4health-1.19.pdf