Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches

Tsutomu Hirao, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura, Akisato Kimura


Abstract
This paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.
Anthology ID:
2024.findings-emnlp.581
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9943–9958
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.581/
DOI:
10.18653/v1/2024.findings-emnlp.581
Bibkey:
Cite (ACL):
Tsutomu Hirao, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura, and Akisato Kimura. 2024. Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9943–9958, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches (Hirao et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.581.pdf