Eugene Agichtein


2024

pdf bib
Combining Multiple Metrics for Evaluating Retrieval-Augmented Conversations
Jason Ingyu Choi | Marcus Collins | Eugene Agichtein | Oleg Rokhlenko | Shervin Malmasi
Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing

Conversational AI is a subtype of Human Computer Interaction that has gained wide adoption. These systems are typically powered by Large Language Models (LLMs) that use Retrieval Augmented Generation (RAG) to infuse external knowledge, which is effective against issues like hallucination. However, automatically evaluating retrieval augmented conversations with minimal human effort remains challenging, particularly in online settings. We address this challenge by proposing a lexical metric, and a novel method for combining it with other metrics, including semantic models. Our approach involves: (1) Conversational Information Utility (CIU), a new automated metric inspired by prior user studies on web search evaluation, to compute information overlap between conversation context and grounded information in an unsupervised, purely lexical way; and (2) a generalized reward model through Mixture-of-Experts (MoE-CIU) that dynamically ensembles CIU with other metrics, including learned ones, into a single reward. Evaluation against human ratings on two public datasets (Topical Chat and Persona Chat) shows that CIU improves correlation against human judgments by 2.0% and 0.9% respectively compared to the second best metric. When MoE is applied to combine lexical and learned semantic metrics, correlations further improve by 9.9% and 5.0%, suggesting that unified reward models are a promising approach.

pdf bib
DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation
Ramraj Chandradevan | Kaustubh Dhole | Eugene Agichtein
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method’s performance and to identify promising areas for further improvements.

pdf bib
QueryExplorer: An Interactive Query Generation Assistant for Search and Exploration
Kaustubh Dhole | Shivam Bajaj | Ramraj Chandradevan | Eugene Agichtein
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

Formulating effective search queries remains a challenging task, particularly when users lack expertise in a specific domain or are not proficient in the language of the content. Providing example documents of interest might be easier for a user. However, such query-by-example scenarios are prone to concept drift, and the retrieval effectiveness is highly sensitive to the query generation method, without a clear way to incorporate user feedback. To enable exploration and to support Human-In-The-Loop experiments we propose QueryExplorer– an interactive query generation, reformulation, and retrieval interface with support for Hug-gingFace generation models and PyTerrier’sretrieval pipelines and datasets, and extensivelogging of human feedback. To allow users to create and modify effective queries, our demo supports complementary approaches of using LLMs interactively, assisting the user with edits and feedback at multiple stages of the query formulation process. With support for recording fine-grained interactions and user annotations, QueryExplorer can serve as a valuable experimental and research platform for annotation, qualitative evaluation, and conducting Human-in-the-Loop (HITL) experiments for complex search tasks where users struggle to formulate queries.

pdf bib
Leveraging Interesting Facts to Enhance User Engagement with Conversational Interfaces
Nikhita Vedula | Giuseppe Castellucci | Eugene Agichtein | Oleg Rokhlenko | Shervin Malmasi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

Conversational Task Assistants (CTAs) guide users in performing a multitude of activities, such as making recipes. However, ensuring that interactions remain engaging, interesting, and enjoyable for CTA users is not trivial, especially for time-consuming or challenging tasks. Grounded in psychological theories of human interest, we propose to engage users with contextual and interesting statements or facts during interactions with a multi-modal CTA, to reduce fatigue and task abandonment before a task is complete. To operationalize this idea, we train a high-performing classifier (82% F1-score) to automatically identify relevant and interesting facts for users. We use it to create an annotated dataset of task-specific interesting facts for the domain of cooking. Finally, we design and validate a dialogue policy to incorporate the identified relevant and interesting facts into a conversation, to improve user engagement and task completion. Live testing on a leading multi-modal voice assistant shows that 66% of the presented facts were received positively, leading to a 40% gain in the user satisfaction rating, and a 37% increase in conversation length. These findings emphasize that strategically incorporating interesting facts into the CTA experience can promote real-world user participation for guided task interactions.

pdf bib
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024
Shervin Malmasi | Besnik Fetahu | Nicola Ueffing | Oleg Rokhlenko | Eugene Agichtein | Ido Guy
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024

pdf bib
Collecting High-quality Multi-modal Conversational Search Data for E-Commerce
Marcus Collins | Oleg Rokhlenko | Eugene Agichtein | Shervin Malmasi
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP

Continued improvement of conversational assistants in knowledge-rich domains like E-Commerce requires large volumes of realistic high-quality conversation data to power increasingly sophisticated large language model chatbots, dialogue managers, response rankers, and recommenders. The problem is exacerbated for multi-modal interactions in realistic conversational product search and recommendation. Here, an artificial sales agent must interact intelligently with a customer using both textual and visual information and incorporate results from external search systems, such as a product catalog. Yet, it remains an open question how to best crowd-source large-scale, naturalistic multi-modal dialogue and action data, required to train such an artificial agent. We describe our crowd-sourced task where one worker (the Buyer) plays the role of the customer, and another (the Seller) plays the role of the sales agent. We identify subtle interactions between one worker’s environment and their partner’s behavior mediated by workers’ word choice. We find that limiting information presented to the Buyer, both in their backstory and by the Seller, improves conversation quality. We also show how conversations are improved through minimal automated Seller “coaching”. While typed and spoken messages are slightly different, the differences are not as large as frequently assumed. We plan to release our platform code and the resulting dialogues to advance research on conversational search agents.

pdf bib
EM_Mixers at MEDIQA-CORR 2024: Knowledge-Enhanced Few-Shot In-Context Learning for Medical Error Detection and Correction
Swati Rajwal | Eugene Agichtein | Abeed Sarker
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper describes our submission to MEDIQA-CORR 2024 shared task for automatic identification and correction of medical errors in a given clinical text. We report results from two approaches: the first uses a few-shot in-context learning (ICL) with a Large Language Model (LLM) and the second approach extends the idea by using a knowledge-enhanced few-shot ICL approach. We used Azure OpenAI GPT-4 API as the LLM and Wikipedia as the external knowledge source. We report evaluation metrics (accuracy, ROUGE, BERTScore, BLEURT) across both approaches for validation and test datasets. Of the two approaches implemented, our experimental results show that the knowledge-enhanced few-shot ICL approach with GPT-4 performed better with error flag (subtask A) and error sentence detection (subtask B) with accuracies of 68% and 64%, respectively on the test dataset. These results positioned us fourth in subtask A and second in subtask B, respectively in the shared task.

2022

pdf bib
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)
Shervin Malmasi | Oleg Rokhlenko | Nicola Ueffing | Ido Guy | Eugene Agichtein | Surya Kallumadi
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

pdf bib
Wizard of Tasks: A Novel Conversational Dataset for Solving Real-World Tasks in Conversational Settings
Jason Ingyu Choi | Saar Kuzi | Nikhita Vedula | Jie Zhao | Giuseppe Castellucci | Marcus Collins | Shervin Malmasi | Oleg Rokhlenko | Eugene Agichtein
Proceedings of the 29th International Conference on Computational Linguistics

Conversational Task Assistants (CTAs) are conversational agents whose goal is to help humans perform real-world tasks. CTAs can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. In this work, we present Wizard of Tasks, the first corpus of such conversations in two domains: Cooking and Home Improvement. We crowd-sourced a total of 549 conversations (18,077 utterances) with an asynchronous Wizard-of-Oz setup, relying on recipes from WholeFoods Market for the cooking domain, and WikiHow articles for the home improvement domain. We present a detailed data analysis and show that the collected data can be a valuable and challenging resource for CTAs in two tasks: Intent Classification (IC) and Abstractive Question Answering (AQA). While on IC we acquired a high performing model (>85% F1), on AQA the performance is far from being satisfactory (~27% BertScore-F1), suggesting that more work is needed to solve the task of low-resource AQA.

2021

pdf bib
Identifying Helpful Sentences in Product Reviews
Iftah Gamzu | Hila Gonen | Gilad Kutiel | Ran Levy | Eugene Agichtein
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In recent years online shopping has gained momentum and became an important venue for customers wishing to save time and simplify their shopping process. A key advantage of shopping online is the ability to read what other customers are saying about products of interest. In this work, we aim to maintain this advantage in situations where extreme brevity is needed, for example, when shopping by voice. We suggest a novel task of extracting a single representative helpful sentence from a set of reviews for a given product. The selected sentence should meet two conditions: first, it should be helpful for a purchase decision and second, the opinion it expresses should be supported by multiple reviewers. This task is closely related to the task of Multi Document Summarization in the product reviews domain but differs in its objective and its level of conciseness. We collect a dataset in English of sentence helpfulness scores via crowd-sourcing and demonstrate its reliability despite the inherent subjectivity involved. Next, we describe a complete model that extracts representative helpful sentences with positive and negative sentiment towards the product and demonstrate that it outperforms several baselines.

pdf bib
You Sound Like Someone Who Watches Drama Movies: Towards Predicting Movie Preferences from Conversational Interactions
Sergey Volokhin | Joyce Ho | Oleg Rokhlenko | Eugene Agichtein
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The increasing popularity of voice-based personal assistants provides new opportunities for conversational recommendation. One particularly interesting area is movie recommendation, which can benefit from an open-ended interaction with the user, through a natural conversation. We explore one promising direction for conversational recommendation: mapping a conversational user, for whom there is limited or no data available, to most similar external reviewers, whose preferences are known, by representing the conversation as a user’s interest vector, and adapting collaborative filtering techniques to estimate the current user’s preferences for new movies. We call our proposed method ConvExtr (Conversational Collaborative Filtering using External Data), which 1) infers a user’s sentiment towards an entity from the conversation context, and 2) transforms the ratings of “similar” external reviewers to predict the current user’s preferences. We implement these steps by adapting contextual sentiment prediction techniques, and domain adaptation, respectively. To evaluate our method, we develop and make available a finely annotated dataset of movie recommendation conversations, which we call MovieSent. Our results demonstrate that ConvExtr can improve the accuracy of predicting users’ ratings for new movies by exploiting conversation content and external data.

pdf bib
VoiSeR: A New Benchmark for Voice-Based Search Refinement
Simone Filice | Giuseppe Castellucci | Marcus Collins | Eugene Agichtein | Oleg Rokhlenko
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Voice assistants, e.g., Alexa or Google Assistant, have dramatically improved in recent years. Supporting voice-based search, exploration, and refinement are fundamental tasks for voice assistants, and remain an open challenge. For example, when using voice to search an online shopping site, a user often needs to refine their search by some aspect or facet. This common user intent is usually available through a “filter-by” interface on online shopping websites, but is challenging to support naturally via voice, as the intent of refinements must be interpreted in the context of the original search, the initial results, and the available product catalogue facets. To our knowledge, no benchmark dataset exists for training or validating such contextual search understanding models. To bridge this gap, we introduce the first large-scale dataset of voice-based search refinements, VoiSeR, consisting of about 10,000 search refinement utterances, collected using a novel crowdsourcing task. These utterances are intended to refine a previous search, with respect to a search facet or attribute (e.g., brand, color, review rating, etc.), and are manually annotated with the specific intent. This paper reports qualitative and empirical insights into the most common and challenging types of refinements that a voice-based conversational search system must support. As we show, VoiSeR can support research in conversational query understanding, contextual user intent prediction, and other conversational search topics to facilitate the development of conversational search systems.

pdf bib
Proceedings of the 4th Workshop on e-Commerce and NLP
Shervin Malmasi | Surya Kallumadi | Nicola Ueffing | Oleg Rokhlenko | Eugene Agichtein | Ido Guy
Proceedings of the 4th Workshop on e-Commerce and NLP

2020

pdf bib
Proceedings of the 3rd Workshop on e-Commerce and NLP
Shervin Malmasi | Surya Kallumadi | Nicola Ueffing | Oleg Rokhlenko | Eugene Agichtein | Ido Guy
Proceedings of the 3rd Workshop on e-Commerce and NLP

2017

pdf bib
EviNets: Neural Networks for Combining Evidence Signals for Factoid Question Answering
Denis Savenkov | Eugene Agichtein
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A critical task for question answering is the final answer selection stage, which has to combine multiple signals available about each answer candidate. This paper proposes EviNets: a novel neural network architecture for factoid question answering. EviNets scores candidate answer entities by combining the available supporting evidence, e.g., structured knowledge bases and unstructured text documents. EviNets represents each piece of evidence with a dense embeddings vector, scores their relevance to the question, and aggregates the support for each candidate to predict their final scores. Each of the components is generic and allows plugging in a variety of models for semantic similarity scoring and information aggregation. We demonstrate the effectiveness of EviNets in experiments on the existing TREC QA and WikiMovies benchmarks, and on the new Yahoo! Answers dataset introduced in this paper. EviNets can be extended to other information types and could facilitate future work on combining evidence signals for joint reasoning in question answering.

2016

pdf bib
Crowdsourcing for (almost) Real-time Question Answering
Denis Savenkov | Scott Weitzner | Eugene Agichtein
Proceedings of the Workshop on Human-Computer Question Answering

2015

pdf bib
Relation Extraction from Community Generated Question-Answer Pairs
Denis Savenkov | Wei-Lwun Lu | Jeff Dalton | Eugene Agichtein
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

2014

pdf bib
Towards Tracking Political Sentiment through Microblog Data
Yu Wang | Tom Clark | Jeffrey Staton | Eugene Agichtein
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media

2013

pdf bib
The Answer is at your Fingertips: Improving Passage Retrieval for Web Question Answering with Search Behavior Data
Mikhail Ageev | Dmitry Lagun | Eugene Agichtein
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
The “Nays” Have It: Exploring Effects of Sentiment in Collaborative Knowledge Sharing
Ablimit Aji | Eugene Agichtein
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media

pdf bib
Towards Automatic Question Answering over Social Media by Learning Question Equivalence Patterns
Tianyong Hao | Wenyin Liu | Eugene Agichtein
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media

pdf bib
Query Ambiguity Revisited: Clickthrough Measures for Distinguishing Informational and Ambiguous Queries
Yu Wang | Eugene Agichtein
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
CoCQA: Co-Training over Questions and Answers with an Application to Predicting Question Subjectivity Orientation
Baoli Li | Yandong Liu | Eugene Agichtein
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
You’ve Got Answers: Towards Personalized Models for Predicting Success in Community Question Answering
Yandong Liu | Eugene Agichtein
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Tutorial Abstracts of ACL-08: HLT
Ani Nenkova | Marilyn Walker | Eugene Agichtein
Tutorial Abstracts of ACL-08: HLT

1998

pdf bib
NYU: Description of the MENE Named Entity System as Used in MUC-7
Andrew Borthwick | John Sterling | Eugene Agichtein | Ralph Grishman
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998

pdf bib
Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition
Andrew Borthwick | John Sterling | Eugene Agichtein | Ralph Grishman
Sixth Workshop on Very Large Corpora