Are Visual-Linguistic Models Commonsense Knowledge Bases?

Hsiu-Yu Yang, Carina Silberer


Abstract
Despite the recent success of pretrained language models as on-the-fly knowledge sources for various downstream tasks, they are shown to inadequately represent trivial common facts that vision typically captures. This limits their application to natural language understanding tasks that require commonsense knowledge. We seek to determine the capability of pretrained visual-linguistic models as knowledge sources on demand. To this end, we systematically compare language-only and visual-linguistic models in a zero-shot commonsense question answering inference task. We find that visual-linguistic models are highly promising regarding their benefit for text-only tasks on certain types of commonsense knowledge associated with the visual world. Surprisingly, this knowledge can be activated even when no visual input is given during inference, suggesting an effective multimodal fusion during pretraining. However, we reveal that there is still a huge space for improvement towards better cross-modal reasoning abilities and pretraining strategies for event understanding.
Anthology ID:
2022.coling-1.491
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5542–5559
Language:
URL:
https://aclanthology.org/2022.coling-1.491
DOI:
Bibkey:
Cite (ACL):
Hsiu-Yu Yang and Carina Silberer. 2022. Are Visual-Linguistic Models Commonsense Knowledge Bases?. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5542–5559, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Are Visual-Linguistic Models Commonsense Knowledge Bases? (Yang & Silberer, COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.491.pdf
Code
 mallory24/cs_probing
Data
Conceptual CaptionsVisual Genome