Lyra 2.0 makes long, explorable 3D worlds from a single photo by fixing two common failure modes
This paper describes Lyra 2.0, a system that starts from a single image and generates long, camera-controlled videos that can be lifted into explorable 3D scenes. The main idea is to combine powerful video synthesis models with a feed-forward 3D reconstruction pipeline so the output becomes real-time-ready 3D Gaussians and surface meshes. The goal is persistent, navigable virtual environments rather than short, locally consistent clips.
The authors point out two key problems that break long walks through generated scenes. First, spatial forgetting: as the camera moves, earlier parts of the scene fall outside the model’s short attention window. When those places are revisited, the model often hallucinates inconsistent structures. Second, temporal drifting: autoregressive generation accumulates small errors step by step, which slowly distorts colors and geometry.
Lyra 2.0 tackles both problems. To prevent spatial forgetting, it keeps per-frame 3D geometry and uses that geometry only to route information — that is, to find and retrieve past frames that are relevant and to build dense correspondences to the next view. The actual pixel appearance is still synthesized by the video model, so the system avoids propagating rendering mistakes from intermediate geometry. To limit temporal drifting, the team trains with self-augmented histories: during training the model is sometimes conditioned on its own recent one-step denoised predictions. This exposes the network to the kinds of errors it will see at inference time and teaches it to correct them rather than amplify them.
After generating long, consistent video trajectories, Lyra 2.0 lifts the frames into 3D using a feed-forward 3D Gaussian Splatting (3DGS) pipeline. The authors fine-tune that reconstruction model on their generated sequences so it learns to tolerate the small multi-view mismatches that diffusion videos can contain. The result is cleaner 3D Gaussians and meshes that are suitable for real-time rendering and simulation. Users can define arbitrary camera paths from the initial image and progressively expand the reconstructed environment.