Making video generators into practical world simulators by cutting cost and waste
This paper reviews how recent video-generation models can act as “world models” that simulate physical dynamics and cause-and-effect over time, and why their high computational cost must be fixed for real use. The authors argue that video models already show emergent understanding of physics and can imagine future scenarios in compressed internal representations. But without efficiency improvements, these models remain too slow and expensive for real-time tasks like driving, robotics, or interactive games.
The work is a systematic review rather than a new experiment. The authors introduce a three-part taxonomy of efficiency work: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. They survey current families of models — for example, diffusion-based methods (including latent diffusion that works in the compact latent space of a pre-trained variational autoencoder), flow-matching methods that treat generation as a continuous ordinary differential equation, and auto-regressive models that generate frames step by step. They also list hybrid approaches and techniques such as diffusion distillation meant to speed up sampling.
At a high level the paper explains how these approaches differ and why some are more efficient. Diffusion models generate by repeatedly denoising a noisy sample and are powerful but can be slow unless sampling is reduced or distilled. Flow-matching methods learn a vector field that moves simple noise to data through time, using continuous-time math. Auto-regressive models produce sequences conditionally and must manage growing key-value (KV) caches to handle long videos, which can cause memory to explode. To reduce cost, efficient architectures reuse compressed representations (like VAEs), add hierarchical structure, store long context in memory modules, and apply cheaper attention or position encodings.