Can Griffon’s large language models spell out object locations accurately at any level?

Original title: Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Authors: Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

In this article, the researchers discuss the challenges of creating models that can detect objects in images based on text descriptions. Currently, most models can only identify a single pre-existing object and rely on specific tasks. This limits the design and requires additional specialized models or structures. However, the researchers explore the untapped potential of large vision language models (LVLMs) and their ability to accurately identify and locate objects of interest. They introduce a new dataset called $\textbf{Griffon}$, which allows LVLMs to integrate object perception and location awareness without the need for special tokens or additional modules. Through comprehensive experiments, they demonstrate that $\textbf{Griffon}$ achieves impressive performance on various localization scenarios and approaches the capabilities of expert models in object detection. This research showcases the potential of LVLMs in advancing object perception tasks.

Original article: https://arxiv.org/abs/2311.14552