Revisiting Sample Size Determination in Natural Language Understanding

Ernie Chang, Muhammad Hassan Rashid, Pin-Jie Lin, Changsheng Zhao, Vera Demberg, Yangyang Shi, Vikas Chandra


Abstract
Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples – which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~0.9%) with only 10% data.
Anthology ID:
2023.findings-acl.419
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6716–6724
Language:
URL:
https://aclanthology.org/2023.findings-acl.419
DOI:
10.18653/v1/2023.findings-acl.419
Bibkey:
Cite (ACL):
Ernie Chang, Muhammad Hassan Rashid, Pin-Jie Lin, Changsheng Zhao, Vera Demberg, Yangyang Shi, and Vikas Chandra. 2023. Revisiting Sample Size Determination in Natural Language Understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6716–6724, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Revisiting Sample Size Determination in Natural Language Understanding (Chang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.419.pdf