2024
pdf
bib
abs
Clusteringbased Sampling for FewShot CrossDomain Keyphrase Extraction
Prakamya Mishra

Lincy Pattanaik

Arunima Sundar

Nishant Yadav

Mayank Kulkarni
Findings of the Association for Computational Linguistics: EACL 2024
Keyphrase extraction is the task of identifying a set of keyphrases present in a document that captures its most salient topics. Scientific domainspecific pretraining has led to achieving stateoftheart keyphrase extraction performance with a majority of benchmarks being within the domain. In this work, we explore how to effectively enable the crossdomain generalization capabilities of such models without requiring the same scale of data. We primarily focus on the fewshot setting in nonscientific domain datasets such as OpenKP from the Web domain & StackEx from the StackExchange forum. We propose to leverage topic information intrinsically available in the data, to build a novel clusteringbased sampling approach that facilitates selecting a few samples to label from the target domain facilitating building robust and performant models. This approach leads to large gains in performance of up to 26.35 points in F1 when compared to selecting fewshot samples uniformly at random. We also explore the setting where we have access to labeled data from the model’s pretraining domain corpora and perform gradual training which involves slowly folding in target domain data to the source domain data. Here we demonstrate further improvements in the model performance by up to 12.76 F1 points.
2023
pdf
bib
abs
Efficient kNN Search with CrossEncoders using Adaptive MultiRound CUR Decomposition
Nishant Yadav

Nicholas Monath

Manzil Zaheer

Andrew McCallum
Findings of the Association for Computational Linguistics: EMNLP 2023
Crossencoder models, which jointly encode and score a queryitem pair, are prohibitively expensive for direct knearest neighbor (kNN) search. Consequently, kNN search typically employs a fast approximate retrieval (e.g. using BM25 or dualencoder vectors), followed by reranking with a crossencoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a crossencoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR’s onetime selection of anchors tends to approximate the crossencoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial endtask: recall of topk items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important topk neighbors. It does so by iteratively performing kNN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and stateoftheart methods such as ANNCUR and dualencoderbased retrieveandrerank, our proposed approach ADACUR consistently reduces recall error—by up to 70% on the important k = 1 setting—while using no more compute than its competitors.
2022
pdf
bib
abs
Efficient Nearest Neighbor Search for CrossEncoder Models using Matrix Factorization
Nishant Yadav

Nicholas Monath

Rico Angell

Manzil Zaheer

Andrew McCallum
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Efficient knearest neighbor search is a fundamental task, foundational for many problems in NLP. When the similarity is measured by dotproduct between dualencoder vectors or L2distance, there already exist many scalable and efficient search methods. But not so when similarity is measured by more accurate and expensive blackbox neural similarity models, such as crossencoders, which jointly encode the query and candidate neighbor. The crossencoders’ high computational cost typically limits their use to reranking candidates retrieved by a cheaper model, such as dual encoder or TFIDF. However, the accuracy of such a twostage approach is upperbounded by the recall of the initial candidate set, and potentially requires additional training to align the auxiliary retrieval model with the crossencoder model. In this paper, we present an approach that avoids the use of a dualencoder for retrieval, relying solely on the crossencoder. Retrieval is made efficient with CUR decomposition, a matrix decomposition approach that approximates all pairwise crossencoder distances from a small subset of rows and columns of the distance matrix. Indexing items using our approach is computationally cheaper than training an auxiliary dualencoder model through distillation. Empirically, for k > 10, our approach provides testtime recallvscomputational cost tradeoffs superior to the current widelyused methods that rerank items retrieved using a dualencoder or TFIDF.
2021
pdf
bib
abs
Clusteringbased Inference for Biomedical Entity Linking
Rico Angell

Nicholas Monath

Sunil Mohan

Nishant Yadav

Andrew McCallum
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clusteringbased inference model further improves entity linking by 2.3 points.
pdf
bib
abs
Event and Entity Coreference using Trees to Encode Uncertainty in Joint Decisions
Nishant Yadav

Nicholas Monath

Rico Angell

Andrew McCallum
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference
Coreference decisions among event mentions and among cooccurring entity mentions are highly interdependent, thus motivating joint inference. Capturing the uncertainty over each variable can be crucial for inference among multiple dependent variables. Previous work on joint coreference employs heuristic approaches, lacking welldefined objectives, and lacking modeling of uncertainty on each side of the joint problem. We present a new approach of joint coreference, including (1) a formal cost function inspired by Dasgupta’s cost for hierarchical clustering, and (2) a representation for uncertainty of clustering of event and entity mentions, again based on a hierarchical structure. We describe an alternating optimization method for inference that when clustering event mentions, considers the uncertainty of the clustering of entity mentions and viceversa. We show that our proposed joint model provides empirical advantages over stateoftheart independent and joint models.
pdf
bib
abs
SUBSUME: A Dataset for Subjective Summary Extraction from Wikipedia Documents
Nishant Yadav

Matteo Brucato

Anna Fariha

Oscar Youngquist

Julian Killingback

Alexandra Meliou

Peter Haas
Proceedings of the Third Workshop on New Frontiers in Summarization
Many applications require generation of summaries tailored to the user’s information needs, i.e., their intent. Methods that express intent via explicit user queries fall short when query interpretation is subjective. Several datasets exist for summarization with objective intents where, for each document and intent (e.g., “weather”), a single summary suffices for all users. No datasets exist, however, for subjective intents (e.g., “interesting places”) where different users will provide different summaries. We present SUBSUME, the first dataset for evaluation of SUBjective SUMmary Extraction systems. SUBSUME contains 2,200 (document, intent, summary) triplets over 48 Wikipedia pages, with ten intents of varying subjectivity, provided by 103 individuals over Mechanical Turk. We demonstrate statistically that the intents in SUBSUME vary systematically in subjectivity. To indicate SUBSUME’s usefulness, we explore a collection of baseline algorithms for subjective extractive summarization and show that (i) as expected, examplebased approaches better capture subjective intents than querybased ones, and (ii) there is ample scope for improving upon the baseline algorithms, thereby motivating further research on this challenging problem.