What is MLLMs’ approach to learning visual-language representation?

Original title: MLLMs-Augmented Visual-Language Representation Learning

Authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

In this article, the authors discuss a method called visual-language pre-training (VLP) that has been successful in multi-modal tasks. They attribute this success to the availability of large-scale image-text datasets. However, they argue that the quality of the data can be further improved by using multi-modal large language models (MLLMs).

The authors propose a simple approach to enhance visual-language representation learning using MLLMs. They extend multiple captions for each image, but to avoid bias from the MLLMs’ hallucinations and intrinsic caption styles, they introduce “text shearing” to maintain the same length for the extended captions as the original captions.

The authors conducted experiments on image-text retrieval and found that their method consistently outperformed previous approaches. Specifically, they achieved a 5.6% to 35.0% improvement on R@1 under the fine-tuning setting and a 16.8% to 46.1% improvement under the zero-shot setting. These results suggest that their approach has potential for versatile use in various tasks.

Original article: https://arxiv.org/abs/2311.18765