Original title: TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
Authors: Qinying Liu, Kecheng Zheng, Wu Wei, Zhan Tong, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen
In this article, the authors discuss the challenge of aligning visual and linguistic information in order to improve learning vision-language models. They highlight the problem of coarse alignment, where the vision encoder struggles to accurately locate specific objects mentioned in text descriptions. To address this issue, the authors propose a simple approach that uses image-text pairs without requiring any additional data formats. Their approach involves parsing objects and attributes from the text description, which are likely to exist in the image. Importantly, this parsing process is fully automatic, making it scalable. The authors then use these parsed semantics as supervision signals to enhance the image-text contrastive loss with a multi-tag classification loss. Through extensive experiments on various semantic segmentation datasets, they demonstrate that their framework outperforms existing alternatives by an average 3.65%. Additionally, visualization results show that attribute supervision helps accurately localize attribute-specified objects in vision-language models.
Original article: https://arxiv.org/abs/2312.14149