Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

Serge Gladkoff, Lifeng Han, Goran Nenadic


Abstract
In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce “Student’s t-Distribution” method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give a quantitative analysis of how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student’s t-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This t-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce.
Anthology ID:
2023.ranlp-1.47
Volume:
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
419–428
Language:
URL:
https://aclanthology.org/2023.ranlp-1.47
DOI:
Bibkey:
Cite (ACL):
Serge Gladkoff, Lifeng Han, and Goran Nenadic. 2023. Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 419–428, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce (Gladkoff et al., RANLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ranlp-1.47.pdf