LIBERO-Safety: a benchmark to test physical and semantic risks for vision‑language‑action robots
Researchers have introduced LIBERO‑Safety, a new benchmark built to check whether robots that understand pictures and text can act safely in cluttered, changing environments. These robots are called vision‑language‑action (VLA) models. The benchmark looks at two types of safety: physical safety (avoiding collisions and unsafe motions) and semantic safety (refusing or avoiding harmful or nonsensical instructions).
To create varied and realistic test cases, the team made a parametric scenario generator called the Unified Behavior Domain Definition Language (UBDDL). UBDDL can produce many different scenes with visual and physical randomness and with three difficulty tiers that separate semantic reasoning from physical constraints. The authors also built a keypose‑driven data pipeline that uses a few expert key poses together with an optimization planner to synthesize many collision‑free motion demonstrations. Using this pipeline they assembled 19,664 human‑screened, strictly collision‑free demonstrations across 40 tasks, with extensive domain randomization over thousands of scenes and hundreds of objects.
The paper uses that dataset and infrastructure to run a broad evaluation. The authors tested eight representative VLA models and two embodied foundation models across their safety tasks. A key finding is a generalization–safety tension: training on more varied data tends to produce safer motion trajectories, but overall task success is still limited. The two main remaining bottlenecks are sub‑optimal trajectory synthesis (the model plans bad or inefficient motions) and semantic misalignment (the model misunderstands instructions or fails to reject harmful ones).
This work matters because real‑world robot deployment needs both reliable motion and correct understanding of instructions. Existing benchmarks often use static, simple setups or slow human teleoperation and so miss many safety risks. LIBERO‑Safety supplies a scalable way to generate safety test cases, a large collision‑free dataset, and a structured set of failure modes. Those resources can help researchers and engineers find when a VLA system might behave unsafely and direct improvements in planning and language grounding.