First-hand information for everyone
This paper presents T2Mo, a system that makes it easier to generate moving 3D shapes by combining short text descriptions with explicit 3D p
Researchers asked whether general pretrained vision-language models (VLMs) can take one still image and recreate it as an editable 3D scene
This paper introduces HyperBench, a software framework that standardizes how researchers test methods for hyperspectral super-resolution. Hy
This paper argues that many failures of vision-language models come not from weak thinking but from poor visual perception. The authors show
This paper argues that image generators should move beyond making realistic pictures to making visuals that understand structure, change ove
This paper introduces HERMES++, a unified “driving world model” that both understands a 3D driving scene and predicts how its geometry will
This paper presents Tuna-2, a multimodal AI model that works directly from raw pixels instead of relying on a separate, pretrained vision en
This paper describes Lyra 2.0, a system that starts from a single image and generates long, camera-controlled videos that can be lifted into
Lyra 2.0 is a method that starts from a single photo and lets a user explore a large, synthetic 3D world. The system first generates a camer
This paper tackles a common failure in agentic multimodal models: they call external tools too often, even when an answer could be found in