Kazuki Hayashi
2025
IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
Kazuki Hayashi
|
Kazuma Onishi
|
Toma Suzuki
|
Yusuke Ide
|
Seiji Gobara
|
Shigeki Saito
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics
Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
2024
Towards Artwork Explanation in Large-scale Vision Language Models
Kazuki Hayashi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Large-scale Vision-Language Models (LVLMs) output text from images and instructions, demonstrating advanced capabilities in text generation and comprehension. However, it has not been clarified to what extent LVLMs understand the knowledge necessary for explaining images, the complex relationships between various pieces of knowledge, and how they integrate these understandings into their explanations. To address this issue, we propose a new task: the artwork explanation generation task, along with its evaluation dataset and metric for quantitatively assessing the understanding and utilization of knowledge about artworks. This task is apt for image description based on the premise that LVLMs are expected to have pre-existing knowledge of artworks, which are often subjects of wide recognition and documented information.It consists of two parts: generating explanations from both images and titles of artworks, and generating explanations using only images, thus evaluating the LVLMs’ language-based and vision-based knowledge.Alongside, we release a training dataset for LVLMs to learn explanations that incorporate knowledge about artworks.Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone. The datasets ExpArt=Explain Artworks are available at https://huggingface.co/datasets/naist-nlp/ExpArt
Search
Fix data
Co-authors
- Katsuhiko Hayashi 2
- Hidetaka Kamigaito 2
- Yusuke Sakai 2
- Taro Watanabe 2
- Seiji Gobara 1
- show all...