CRAPES:Cross-modal Annotation Projection for Visual Semantic Role Labeling

Abhidip Bhattacharyya, Martha Palmer, Christoffer Heckman


Abstract
Automatic image comprehension is an important yet challenging task that includes identifying actions in an image and corresponding action participants. Most current approaches to this task, now termed Grounded Situation Recognition (GSR), start by predicting a verb that describes the action and then predict the nouns that can participate in the action as arguments to the verb. This problem formulation limits each image to a single action even though several actions could be depicted. In contrast, text-based Semantic Role Labeling (SRL) aims to label all actions in a sentence, typically resulting in at least two or three predicate argument structures per sentence. We hypothesize that expanding GSR to follow the more liberal SRL text-based approach to action and participant identification could improve image comprehension results. To test this hypothesis and to preserve generalization capabilities, we use general-purpose vision and language components as a front-end. This paper presents our results, a substantial 28.6 point jump in performance on the SWiG dataset, which confirm our hypothesis. We also discuss the benefits of loosely coupled broad-coverage off-the-shelf components which generalized well to out of domain images, and can decrease the need for manual image semantic role annotation.
Anthology ID:
2023.starsem-1.7
Volume:
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Alexis Palmer, Jose Camacho-collados
Venue:
*SEM
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
61–70
Language:
URL:
https://aclanthology.org/2023.starsem-1.7
DOI:
10.18653/v1/2023.starsem-1.7
Bibkey:
Cite (ACL):
Abhidip Bhattacharyya, Martha Palmer, and Christoffer Heckman. 2023. CRAPES:Cross-modal Annotation Projection for Visual Semantic Role Labeling. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 61–70, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
CRAPES:Cross-modal Annotation Projection for Visual Semantic Role Labeling (Bhattacharyya et al., *SEM 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.starsem-1.7.pdf