Computer VisionEnglishPublished

HERMES++: A single model that reads and predicts 3D driving scenes

May 3, 2026arXiv: 2604.28196v1

This paper introduces HERMES++, a unified “driving world model” that both understands a 3D driving scene and predicts how its geometry will change over time. In plain terms, the system aims to answer questions about what is around the car now and also forecast how nearby objects and the scene itself will move in the near future. The authors build this in one framework instead of using separate models for understanding and for generation.

To do this, the team uses a Bird’s-Eye View (BEV) representation. BEV means the system converts multiple camera views into a top-down spatial map that keeps geometric relationships and is easier to feed into language-style processing. They add LLM-enhanced world queries — short, learnable probes that the language model uses to pull scene knowledge from the BEV features. A Current-to-Future Link then conditions future geometry on the current semantic context, and a Textual Injection step lets text embeddings influence how the model generates future scenes. To keep predicted shapes realistic, they apply a Joint Geometric Optimization that combines explicit constraints on point clouds with implicit regularization in the model’s internal (latent) space.

The paper reports experiments on multiple benchmarks. HERMES++ outperforms specialist baselines on both tasks: it reduces error in 3-second future point-cloud generation by 8.2% compared with the leading method DriveX, and it improves scene-understanding scores (CIDEr metric) by 9.2% over the prior specialist Omni-Q on the OmniDrive-nuScenes dataset. Compared with the authors’ own earlier conference version, the new model cuts generation error by 13.7% thanks to the added latent regularization and other changes. The authors say they will release model code and checkpoints publicly.

Why this matters: autonomous vehicles need both clear semantic understanding of their surroundings (what objects are, where they are) and accurate short-term physical forecasts (how those objects will move). By putting understanding and generation in one model and letting language-style reasoning inform geometric prediction, HERMES++ aims to close a practical gap between interpretation and simulation. That could help downstream tasks that need both kinds of information, such as planning and collision avoidance.