arXiv News

First-hand information for everyone

Legal

Privacy PolicyTerms of Use

© 2026 arXiv News

arXiv News

EnglishJapanese

Switch language

EnglishJapanese
Loading account…

First-hand information for everyone

Latest
T2Mo: Controlling 3D object motion with simple paths and a text promptThinking in Blender: using vision-language models to turn a single photo into an editable 3D Blender sceneHyperBench: a shared testbed to fairly and widely test hyperspectral super-resolution methodsTraining vision-language models in stages fixes sight before thought and improves resultsFrom pixels to world models: a five-level roadmap for intelligent visual generationHERMES++: A single model that reads and predicts 3D driving scenesTuna-2 drops pretrained vision encoders and learns directly from pixels for image understanding and generationLyra 2.0 makes long, explorable 3D worlds from a single photo by fixing two common failure modesLyra 2.0 makes large, explorable 3D scenes from one image by fixing two common video-generation failuresNew training method teaches multimodal agents when not to call tools — Metis cuts tool calls from 98% to 2%T2Mo: Controlling 3D object motion with simple paths and a text promptThinking in Blender: using vision-language models to turn a single photo into an editable 3D Blender sceneHyperBench: a shared testbed to fairly and widely test hyperspectral super-resolution methodsTraining vision-language models in stages fixes sight before thought and improves resultsFrom pixels to world models: a five-level roadmap for intelligent visual generationHERMES++: A single model that reads and predicts 3D driving scenesTuna-2 drops pretrained vision encoders and learns directly from pixels for image understanding and generationLyra 2.0 makes long, explorable 3D worlds from a single photo by fixing two common failure modesLyra 2.0 makes large, explorable 3D scenes from one image by fixing two common video-generation failuresNew training method teaches multimodal agents when not to call tools — Metis cuts tool calls from 98% to 2%

Today's Briefing

Tuesday, June 16, 2026
AllArtificial IntelligenceMachine LearningNatural Language ProcessingComputer VisionRoboticsCryptographyPhysicsMathematics
Computer VisionFeatured briefing

T2Mo: Controlling 3D object motion with simple paths and a text prompt

This paper presents T2Mo, a system that makes it easier to generate moving 3D shapes by combining short text descriptions with explicit 3D p

June 4, 2026EN2 min read
Read full article

Latest Research

Computer Vision
June 2, 2026

Thinking in Blender: using vision-language models to turn a single photo into an editable 3D Blender scene

Researchers asked whether general pretrained vision-language models (VLMs) can take one still image and recreate it as an editable 3D scene

EN
2 min read
Computer Vision
May 24, 2026

HyperBench: a shared testbed to fairly and widely test hyperspectral super-resolution methods

This paper introduces HyperBench, a software framework that standardizes how researchers test methods for hyperspectral super-resolution. Hy

EN
2 min read
Natural Language Processing
May 20, 2026

Training vision-language models in stages fixes sight before thought and improves results

This paper argues that many failures of vision-language models come not from weak thinking but from poor visual perception. The authors show

EN
2 min read
Advertisement
Computer Vision
May 3, 2026

From pixels to world models: a five-level roadmap for intelligent visual generation

This paper argues that image generators should move beyond making realistic pictures to making visuals that understand structure, change ove

EN
2 min read
Computer Vision
May 3, 2026

HERMES++: A single model that reads and predicts 3D driving scenes

This paper introduces HERMES++, a unified “driving world model” that both understands a 3D driving scene and predicts how its geometry will

EN
2 min read
Computer Vision
April 28, 2026

Tuna-2 drops pretrained vision encoders and learns directly from pixels for image understanding and generation

This paper presents Tuna-2, a multimodal AI model that works directly from raw pixels instead of relying on a separate, pretrained vision en

EN
2 min read
Computer Vision
April 16, 2026

Lyra 2.0 makes long, explorable 3D worlds from a single photo by fixing two common failure modes

This paper describes Lyra 2.0, a system that starts from a single image and generates long, camera-controlled videos that can be lifted into

EN
2 min read
Computer Vision
April 15, 2026

Lyra 2.0 makes large, explorable 3D scenes from one image by fixing two common video-generation failures

Lyra 2.0 is a method that starts from a single photo and lets a user explore a large, synthetic 3D world. The system first generates a camer

EN
2 min read
Artificial Intelligence
April 10, 2026

New training method teaches multimodal agents when not to call tools — Metis cuts tool calls from 98% to 2%

This paper tackles a common failure in agentic multimodal models: they call external tools too often, even when an answer could be found in

EN
2 min read
Next page of briefings