2024
pdf
bib
abs
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
|
Iker García-Ferrero
|
Alon Jacovi
|
Jon Ander Campos
|
Yanai Elazar
|
Eneko Agirre
|
Yoav Goldberg
|
Wei-Lin Chen
|
Jenny Chim
|
Leshem Choshen
|
Luca D’Amico-Wong
|
Melissa Dell
|
Run-Ze Fan
|
Shahriar Golchin
|
Yucheng Li
|
Pengfei Liu
|
Bhavish Pahwa
|
Ameya Prabhu
|
Suryansh Sharma
|
Emily Silcock
|
Kateryna Solonko
|
David Stap
|
Mihai Surdeanu
|
Yu-Min Tseng
|
Vishaal Udandarao
|
Zengzhi Wang
|
Ruijie Xu
|
Jinglin Yang
Proceedings of the 1st Workshop on Data Contamination (CONDA)
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
pdf
bib
abs
News Deja Vu: Connecting Past and Present with Semantic Search
Brevin Franklin
|
Emily Silcock
|
Abhishek Arora
|
Tom Bryan
|
Melissa Dell
Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)
Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with keywords, can be brittle given complex vocabularies and OCR noise. This study introduces News Deja Vu, a novel semantic search tool that leverages transformer large language models and a bi-encoder approach to identify historical news articles that are most similar to modern news queries. News Deja Vu first recognizes and masks entities, in order to focus on broader parallels rather than the specific named entities being discussed. Then, a contrastively trained, lightweight bi-encoder retrieves historical articles that are most similar semantically to a modern query, illustrating how phenomena that might seem unique to the present have varied historical precedents. Aimed at social scientists, the user-friendly News Deja Vu package is designed to be accessible for those who lack extensive familiarity with deep learning. It works with large text datasets, and we show how it can be deployed to a massive scale corpus of historical, open-source news articles. While human expertise remains important for drawing deeper insights, News Deja Vu provides a powerful tool for exploring parallels in how people have perceived past and present.
pdf
bib
abs
Efficient OCR for Building a Diverse Digital History
Jacob Carlson
|
Tom Bryan
|
Melissa Dell
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Many users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model – is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters’ visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, it opens new avenues for community engagement in making digital history more representative of documentary history.
pdf
bib
abs
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Abhishek Arora
|
Melissa Dell
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Many computational analyses require linking information across noisy text datasets. While large language models (LLMs) offer significant promise, approximate string matching packages in popular statistical softwares such as R and Stata remain predominant in academic applications. These packages have simple interfaces and can be easily extended to a diversity of languages and settings, and for academic applications, ease-of-use and extensibility are essential. In contrast, packages for record linkage with LLMs require significant familiarity with deep learning frameworks and often focus on specialized applications of commercial value in English. The open-source package LinkTransformer aims to bridge this gap by providing an end-to-end software for performing record linkage and other data cleaning tasks with transformer LLMs, treating linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage. LinkTransformer contains a rich repository of pre-trained models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI, providing the extensibility required for many scholarly applications. Its APIs also perform common data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. LinkTransformer contains comprehensive tools for efficient model tuning, allowing for highly customized applications, and users can easily contribute their custom-trained models to its model hub to ensure reproducibility. Using a novel benchmark dataset geared towards academic applications, we show that LinkTransformer - with both custom models and Hugging Face or OpenAI models off-the-shelf - outperforms string matching by a wide margin. By combining transformer LMs with intuitive APIs, LinkTransformer aims to democratize these performance gains for those who lack familiarity with deep learning frameworks.
2023
pdf
bib
abs
Quantifying Character Similarity with Vision Transformers
Xinmei Yang
|
Abhishek Arora
|
Shao-Yu Jheng
|
Melissa Dell
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR’ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as “O” and “0” - have similar vector representations. Using the cosine distance between characters’ representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.
pdf
bib
abs
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge
Tom Bryan
|
Jacob Carlson
|
Abhishek Arora
|
Melissa Dell
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters’ visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code and include lightweight models designed for mobile phones that are extremely cheap to deploy. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing a Japanese document collection for which all other OCR solutions failed.
2022
pdf
bib
abs
OLALA: Object-Level Active Learning for Efficient Document Layout Annotation
Zejiang Shen
|
Weining Li
|
Jian Zhao
|
Yaoliang Yu
|
Melissa Dell
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
Layout detection is an essential step for accurately extracting structured contents from historical documents. The intricate and varied layouts present in these document images make it expensive to label the numerous layout regions that can be densely arranged on each page. Current active learning methods typically rank and label samples at the image level, where the annotation budget is not optimally spent due to the overexposure of common objects per image. Inspired by recent progress in semi-supervised learning and self-training, we propose OLALA, an Object-Level Active Learning framework for efficient document layout Annotation. OLALA aims to optimize the annotation process by selectively annotating only the most ambiguous regions within an image, while using automatically generated labels for the rest. Central to OLALA is a perturbation-based scoring function that determines which objects require manual annotation. Extensive experiments show that OLALA can significantly boost model performance and improve annotation efficiency, facilitating the extraction of masses of structured text for downstream NLP applications.