Contrastive Entity Coreference and Disambiguation for Historical Texts

Abhishek Arora, Emily Silcock, Melissa Dell, Leander Heldring


Abstract
Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledge bases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledge bases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this historical benchmark. We contrastively train bi-encoder models for coreferencing and disambiguating individuals in historical texts, achieving accurate, scalable performance that identifies out-of-knowledge base individuals. Our approach significantly surpasses other entity disambiguation models on our historical newswire benchmark. Our models also demonstrate competitive performance on modern entity disambiguation benchmarks, particularly on certain news disambiguation datasets.
Anthology ID:
2024.emnlp-main.355
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6174–6186
Language:
URL:
https://aclanthology.org/2024.emnlp-main.355/
DOI:
10.18653/v1/2024.emnlp-main.355
Bibkey:
Cite (ACL):
Abhishek Arora, Emily Silcock, Melissa Dell, and Leander Heldring. 2024. Contrastive Entity Coreference and Disambiguation for Historical Texts. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6174–6186, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Contrastive Entity Coreference and Disambiguation for Historical Texts (Arora et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.355.pdf
Software:
 2024.emnlp-main.355.software.zip
Data:
 2024.emnlp-main.355.data.zip