Can PG-Video-LLaVA Improve Video-Language Models with Pixel Grounding?

Original title: PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Authors: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan

Creating Large Multimodal Models (LMM) for videos presents challenges due to their complexity. Current methods struggle with grounding and leveraging audio cues effectively. To fill these gaps, Video-LLaVA is introduced as the first LMM capable of pixel-level grounding. By integrating transcribed audio cues into text, this framework enriches video understanding. Using a standard tracker and a unique grounding module, Video-LLaVA locates objects in videos based on user instructions. Assessment across various tasks and the introduction of new benchmarks reveal its success in object grounding. Additionally, the article suggests Vicuna for conversation benchmarks, ensuring reproducibility without the limitations of GPT-3.5. Building on the image-based LLaVA model, Video-LLaVA demonstrates significant advancements in video-based conversations and grounding tasks, extending its capabilities into the realm of videos. The project details can be found at the provided URL.

Original article: https://arxiv.org/abs/2311.13435