Thinking in Blender: using vision-language models to turn a single photo into an editable 3D Blender scene
Researchers asked whether general pretrained vision-language models (VLMs) can take one still image and recreate it as an editable 3D scene expressed as a runnable Blender program. They call their approach Staged Executable Inverse Graphics (SEIG). The goal is to produce a scene that can be re-rendered, relit, and manipulated, all from code that Blender can execute.
SEIG works by breaking the hard inverse graphics problem into stages. Instead of trying to solve everything at once, the system progressively refines separate scene factors: geometry (the shapes and positions of objects), materials (how surfaces look), composition (scene layout), and lighting. These refinements happen directly in executable Blender code space, so the output is a program that recreates the scene rather than just a set of numeric parameters.
A key point is that the team relies on off-the-shelf pretrained vision-language models to read the image and produce or revise Blender code. Vision-language models are systems trained to connect visual inputs and words. SEIG deliberately does not use specialized 2D or 3D foundation models, differentiable renderers (tools that allow gradients to flow through rendering), or multi-view supervision (multiple camera views of the same scene). That design lets the method try to do executable inverse graphics with more general-purpose building blocks.
The authors evaluated SEIG on a variety of scenes using multiple reconstruction metrics. They measured fidelity at the pixel level, perceptual similarity (how images look to humans), and semantic fidelity (whether the same objects and relations are present). Their experiments found that doing reconstruction in stages — separating geometry, materials, composition, and lighting — substantially improved reconstruction fidelity compared with trying to do everything in one step.