A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy Phung, Mathieu Ravaut, Shafiq Joty, Josip Car


Abstract
Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. (Code and dataset are available at https://github.com/IvaBojic/framework). We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.
Anthology ID:
2023.insights-1.3
Volume:
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19–32
Language:
URL:
https://aclanthology.org/2023.insights-1.3
DOI:
10.18653/v1/2023.insights-1.3
Bibkey:
Cite (ACL):
Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy Phung, Mathieu Ravaut, Shafiq Joty, and Josip Car. 2023. A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 19–32, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets (Bojic et al., insights-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.insights-1.3.pdf
Video:
 https://aclanthology.org/2023.insights-1.3.mp4