Can Human Feedback Improve Diffusion Models Without Rewards?

Original title: Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Authors: Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li

An article delves into refining diffusion models using human feedback without relying on reward models. Traditional methods need complex reward models, demanding extensive data and manual adjustments. The new approach, Direct Preference for Denoising Diffusion Policy Optimization (D3PO), sidesteps this need. By eliminating the necessity for a reward model, D3PO efficiently fine-tunes diffusion models using human feedback. The method demonstrates comparable performance to models relying on ground-truth rewards. Importantly, D3PO lowers image distortion rates and enhances image safety, addressing challenges faced by models lacking robust reward systems. The innovation lies in directly leveraging human feedback to guide the learning process, proving cost-effective and computationally efficient.

Original article: https://arxiv.org/abs/2311.13231