High-Order Semantic Alignment for Unsupervised Fine-Grained Image-Text Retrieval

Rui Gao, Miaomiao Cheng, Xu Han, Wei Song


Abstract
Cross-modal retrieval is an important yet challenging task due to the semantic discrepancy between visual content and language. To measure the correlation between images and text, most existing research mainly focuses on learning global or local correspondence, failing to explore fine-grained local-global alignment. To infer more accurate similarity scores, we introduce a novel High Order Semantic Alignment (HOSA) model that can provide complementary and comprehensive semantic clues. Specifically, to jointly learn global and local alignment and emphasize local-global interaction, we employ tensor-product (t-product) operation to reconstruct one modal’s representation based on another modal’s information in a common semantic space. Such a cross-modal reconstruction strategy would significantly enhance inter-modal correlation learning in a fine-grained manner. Extensive experiments on two benchmark datasets validate that our model significantly outperforms several state-of-the-art baselines, especially in retrieving the most relevant results.
Anthology ID:
2024.lrec-main.714
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
8155–8165
Language:
URL:
https://aclanthology.org/2024.lrec-main.714
DOI:
Bibkey:
Cite (ACL):
Rui Gao, Miaomiao Cheng, Xu Han, and Wei Song. 2024. High-Order Semantic Alignment for Unsupervised Fine-Grained Image-Text Retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8155–8165, Torino, Italia. ELRA and ICCL.
Cite (Informal):
High-Order Semantic Alignment for Unsupervised Fine-Grained Image-Text Retrieval (Gao et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.714.pdf