Original title: MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-grounded Dialogue Generation
Authors: Hongcheng Liu, Zhe Chen, Hui Li, Pingjie Wang, Yanfeng Wang, Yu Wang
The article addresses the challenge of generating dialogues grounded in videos, emphasizing the limitations of existing visual-language models in understanding video scenes. Introducing MSG-BART, a novel method, it enhances video integration by incorporating a multi-granularity spatio-temporal scene graph into a pre-trained language model’s encoder-decoder structure. This integration of global and local scene graphs refines overall perception and reasoning capabilities. To boost information selection, a multi-pointer network aids in choosing between text and video content. Across three video-grounded dialogue benchmarks, extensive experiments showcase MSG-BART’s significant superiority over state-of-the-art approaches. This innovative model not only bridges the gap in video understanding but also elevates dialogue generation by effectively leveraging visual context, marking a milestone in video-grounded dialogue systems.
Original article: https://arxiv.org/abs/2311.12820