RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets

Bhathiya Hemanthage, Hakan Bilen, Phil Bartie, Christian Dondrup, Oliver Lemon


Abstract
The Generalized Referring Expression Comprehension (GREC) task extends classic REC by generating image bounding boxes for objects referred to in natural language expressions, which may indicate zero, one, or multiple targets. This generalization enhances the practicality of REC models for diverse real-world applications. However, the presence of varying numbers of targets in samples makes GREC a more complex task, both in terms of training supervision and final prediction selection strategy. Addressing these challenges, we introduce RECANTFormer, a one-stage method for GREC that combines a decoder-free (encoder-only) transformer architecture with DETR-like Hungarian matching. Our approach consistently outperforms baselines by significant margins in three GREC datasets.
Anthology ID:
2024.emnlp-main.1214
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21784–21798
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1214/
DOI:
10.18653/v1/2024.emnlp-main.1214
Bibkey:
Cite (ACL):
Bhathiya Hemanthage, Hakan Bilen, Phil Bartie, Christian Dondrup, and Oliver Lemon. 2024. RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21784–21798, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets (Hemanthage et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1214.pdf