Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning

Nurbanu Aksoy; Nishant Ravikumar; Serge Sharoff

Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning

Nurbanu Aksoy, Nishant Ravikumar, Serge Sharoff

Abstract

Image-to-text generation involves automatically generating descriptive text from images and has applications in medical report generation. However, traditional approaches often exhibit a semantic gap between visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning, especially with text generation prioritisation, improved performance over single-task baselines across language generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models. Qualitative analysis showed logically coherent narratives and accurate identification of findings, though some repetition and disjointed phrasing remained. This work demonstrates the benefits of multi-modal, multi-task learning for image-to-text generation applications.

Anthology ID:: 2024.lrec-main.529
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 5977–5985
Language:
URL:: https://aclanthology.org/2024.lrec-main.529
DOI:
Bibkey:
Cite (ACL):: Nurbanu Aksoy, Nishant Ravikumar, and Serge Sharoff. 2024. Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5977–5985, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning (Aksoy et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.529.pdf

PDF Cite Search