Survey argues making video generators efficient is key to turning them into practical world simulators
This paper reviews recent work that treats video generation models as “world models” and focuses on one central problem: these models can in principle simulate physics and long-term cause-and-effect, but they are currently too slow and costly for practical use. The authors organize the field around efficiency. They propose a three-part taxonomy that groups methods by efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. They argue that closing the efficiency gap is essential for real-time, interactive uses such as autonomous driving, embodied robots, and game simulation.
What the authors did is a systematic, literature-based survey rather than a new experiment. They collect and summarize techniques across different video-generation approaches. These include diffusion-based methods (which produce video by iteratively “denoising” random input), flow-matching methods (which view generation as moving samples along a continuous path), and auto-regressive models (which predict frames step by step). On the architecture side they review designs like variational autoencoders (VAE, a way to compress images into a smaller internal representation), hierarchical models, memory mechanisms for long context, and attention methods. For inference they cover practical tricks such as distillation (making a slow model faster), parallelism, caching, pruning, and quantization (making models use less memory and compute).
At a high level, the efficiency ideas reduce the work the model must do. Working in a compact latent space instead of full pixels cuts the data size. Memory mechanisms and long-context strategies let a model keep important history without storing every frame. Efficient attention and other architectural choices reduce the number of expensive operations. For sampling-heavy methods like diffusion models, distillation and improved samplers aim to cut the many iterative steps that cause latency. For stepwise auto-regressive models, careful handling of key-value caches helps avoid explosive memory use as sequences grow.