Multimodal Language Models Show Evidence of Embodied Simulation

Cameron R. Jones, Sean Trott


Abstract
Multimodal large language models (MLLMs) are gaining popularity as partial solutions to the “symbol grounding problem” faced by language models trained on text alone. However, little is known about whether and how these multiple modalities are integrated. We draw inspiration from analogous work in human psycholinguistics on embodied simulation, i.e., the hypothesis that language comprehension is grounded in sensorimotor representations. We show that MLLMs are sensitive to implicit visual features like object shape (e.g., “The egg was in the skillet” implies a frying egg rather than one in a shell). This suggests that MLLMs activate implicit information about object shape when it is implied by a verbal description of an event. We find mixed results for color and orientation, and rule out the possibility that this is due to models’ insensitivity to those features in our dataset overall. We suggest that both human psycholinguistics and computational models of language could benefit from cross-pollination, e.g., with the potential to establish whether grounded representations play a functional role in language processing.
Anthology ID:
2024.lrec-main.1041
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
11928–11933
Language:
URL:
https://aclanthology.org/2024.lrec-main.1041
DOI:
Bibkey:
Cite (ACL):
Cameron R. Jones and Sean Trott. 2024. Multimodal Language Models Show Evidence of Embodied Simulation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11928–11933, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Multimodal Language Models Show Evidence of Embodied Simulation (Jones & Trott, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1041.pdf