Zero-shot system turns multi-view photos and language instructions into 3D plans for dexterous robot tasks
This paper describes a system that can follow natural language instructions to perform long, multi-step manipulation tasks without any task-specific training. “Zero-shot” here means the system is not trained end-to-end on the target tasks. Instead it uses a pre-trained vision-language model to link words in an instruction to points and actions in a scene captured by several calibrated color cameras (RGB images).
The researchers feed multiple calibrated camera views into a vision-language model (VLM) to get reference-frame task grounding and simple 2D keypoints — that is, image pixels that mark important places like grasp or place locations. They then “lift” those 2D keypoints into 3D by combining information from different camera views. Part of this lifting is triangulation, which finds the 3D point that best matches the 2D locations seen by different cameras. They add a second step called reference-view ray voting, which searches along the camera ray from a chosen view for candidates that are consistent across nearby views.
Once the system has 3D keypoints, it can plan concrete actions. For pick-and-place it uses the 3D grasp points directly. For tool-use tasks it first infers a skill category, retrieves a matching object-centered atomic action (a stored small action), and aligns that action’s stored 6D tool trajectory (six degrees of freedom: position and orientation over time) to the scene. For dexterous hand execution the system expands a lifted grasp keypoint into a grasp affordance region (an area where a successful grasp is likely) and generates feasible arm-and-hand motion pairs with a motion generator.
The authors tested the system in the real world and report improvements in 3D grounding accuracy and execution reliability compared to single-view color-plus-depth (RGB-D) grounding and compared to fine-tuned VLA (vision-language-action) baselines. They also show that the method can run closed-loop: it checks task status during execution and can replan, which helps accomplish longer, multi-step tasks on objects and tools the system has not seen before.