Computer VisionEnglishPublished

From pixels to world models: a five-level roadmap for intelligent visual generation

May 3, 2026arXiv: 2604.28185v1

This paper argues that image generators should move beyond making realistic pictures to making visuals that understand structure, change over time, and follow cause-and-effect. The authors say current models are very good at photorealism, fonts, following simple instructions, and basic interactive editing. But they still fail at tasks that need spatial reasoning, a lasting scene state, long-term consistency, and causal understanding.

To frame this shift, the paper lays out a five-level taxonomy of visual generation. The levels start with Atomic Generation, which is basic image rendering. They progress through Conditional Generation and In-Context Generation, and then to Agentic Generation and World-Modeling Generation. The later levels describe systems that are interactive, act like agents, and maintain a richer, world-aware internal model of scenes.

The authors analyze the technical trends that could enable this change. They discuss training techniques such as flow matching, models that combine understanding and generation, better visual representations, and post-training approaches. They also highlight tools like reward modeling (learning from feedback), careful data curation, creating distilled synthetic data, and faster sampling methods. Each of these is presented as a lever that can help models learn structure, dynamics, and higher-level knowledge instead of only surface appearance.

A key point is that current evaluations often overestimate progress. Many benchmarks reward how realistic an image looks, but miss failures in structure, timing, and cause-and-effect. To address this, the authors combine a review of benchmarks with stress tests on real examples and expert-guided case studies. They present this mix as a capability-centered way to measure and guide future work.