Computer VisionEnglishPublished

Lyra 2.0 makes large, explorable 3D scenes from one image by fixing two common video-generation failures

April 15, 2026arXiv: 2604.13036v1

Lyra 2.0 is a method that starts from a single photo and lets a user explore a large, synthetic 3D world. The system first generates a camera-controlled video that simulates walking through the scene. It then lifts those generated frames into a 3D model ready for real-time rendering and simulation. The paper focuses on making these generated scenes stay consistent over long walks and large viewpoint changes.

The authors identified two main failure modes that break long explorations. The first is spatial forgetting: as the camera moves, parts seen earlier fall outside the short time window the video model remembers, so the model hallucinates different structures when those places are revisited. The second is temporal drifting: because frames are generated one after another, small errors accumulate and the scene’s colors and shapes slowly drift away from reality. Left unchecked, both problems ruin the idea of a persistent 3D world.

To fix spatial forgetting, Lyra 2.0 keeps a lightweight 3D proxy for every generated frame and uses those proxies only to find and fetch relevant past frames and to build dense pixel correspondences to the current view. In plain terms, the system uses geometry as a guide to look up the right past images, but it does not force the 3D proxy to be the final appearance. The actual image synthesis still comes from a video diffusion model — a type of generative video network trained on large image and video collections — so the model can use its learned sense of how things should look.

To reduce temporal drifting, the team trains the video model with a self-augmentation trick. During training they sometimes condition the model on its own recent denoised predictions instead of perfect ground-truth frames. This exposes the network to the kinds of small errors it will actually make at test time and teaches it to correct them instead of passing them on. Combined with retrieving high-overlap historical frames, this helps the model keep appearance and geometry stable over much longer camera paths.