How does Visual In-Context Prompting work?

Original title: Visual In-Context Prompting

Authors: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao

This article introduces a novel method called “Visual In-Context Prompting” to improve vision tasks using large language models (LLMs). While prior approaches focused on segmenting relevant objects, this new framework aims to enhance broader vision tasks like open-set segmentation and detection. It leverages an adaptable prompt encoder within an encoder-decoder architecture, allowing various prompts such as strokes, boxes, and points. Notably, it can use multiple reference image segments as context. Testing reveals exceptional referring and segmentation capabilities, delivering competitive performance on familiar datasets and promising results on open-set segmentation tasks. For instance, it achieves a PQ of $57.7$ on COCO and $23.2$ on ADE20K by jointly training on COCO and SA-1B datasets. The article indicates code availability at a specified web link.

Original article: https://arxiv.org/abs/2311.13601