A Cross-document Coreference Dataset for Longitudinal Tracking across Radiology Reports

Surabhi Datta, Hio Cheng Lam, Atieh Pajouhi, Sunitha Mogalla, Kirk Roberts


Abstract
This paper proposes a new cross-document coreference resolution (CDCR) dataset for identifying co-referring radiological findings and medical devices across a patient’s radiology reports. Our annotated corpus contains 5872 mentions (findings and devices) spanning 638 MIMIC-III radiology reports across 60 patients, covering multiple imaging modalities and anatomies. There are a total of 2292 mention chains. We describe the annotation process in detail, highlighting the complexities involved in creating a sizable and realistic dataset for radiology CDCR. We apply two baseline methods–string matching and transformer language models (BERT)–to identify cross-report coreferences. Our results indicate the requirement of further model development targeting better understanding of domain language and context to address this challenging and unexplored task. This dataset can serve as a resource to develop more advanced natural language processing CDCR methods in the future. This is one of the first attempts focusing on CDCR in the clinical domain and holds potential in benefiting physicians and clinical research through long-term tracking of radiology findings.
Anthology ID:
2022.lrec-1.393
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3686–3695
Language:
URL:
https://aclanthology.org/2022.lrec-1.393
DOI:
Bibkey:
Cite (ACL):
Surabhi Datta, Hio Cheng Lam, Atieh Pajouhi, Sunitha Mogalla, and Kirk Roberts. 2022. A Cross-document Coreference Dataset for Longitudinal Tracking across Radiology Reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3686–3695, Marseille, France. European Language Resources Association.
Cite (Informal):
A Cross-document Coreference Dataset for Longitudinal Tracking across Radiology Reports (Datta et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.393.pdf
Data
MIMIC-III