Signal ProcessingEnglishPublished

Edit photos from a single visual example — no text required, say authors

March 27, 2026arXiv: 2603.25441v1

The paper presents Visual Diffusion Conditioning (VDC), a new way to edit images using just a visual example instead of text instructions. The authors start from a surprising finding: state-of-the-art (SOTA) text-guided diffusion models often fail at simple, everyday edits such as adding rain or blur. They trace this failure to weak and inconsistent language supervision during training, which harms the alignment between words and visual changes.

Rather than relying on words, VDC learns a conditioning signal directly from a pair of images: one with the target effect and one without it. That visual condition captures the transformation the user wants. The method then steers a pre-trained diffusion model — a kind of generative model that creates images by gradually removing noise — so that the model applies the same change to new photos. The authors also add an inversion-correction step to reduce reconstruction errors that happen when mapping real images into the model’s internal representation (a common step often called DDIM inversion).

VDC is training-free, which means it does not require extra fine-tuning of the underlying diffusion model or large additional datasets. According to the paper, this makes the approach more cost-efficient than methods that need heavy re-training or stronger text conditioning. The authors report that VDC outperforms both other training-free methods and some fully fine-tuned text-based editors across a range of editing tasks. They also provide code and models as open source.

There are important limits to keep in mind. VDC needs a paired example — an image before and after the effect — so it cannot be used where no such example exists. The paper also notes and addresses reconstruction errors from DDIM inversion; the proposed correction reduces those errors but does not claim they vanish entirely. Finally, the abstract reports outperforming other methods but does not give numbers here, so readers should check the full paper for detailed evaluations.