New training method teaches multimodal agents when not to call tools — Metis cuts tool calls from 98% to 2%
This paper tackles a common failure in agentic multimodal models: they call external tools too often, even when an answer could be found in the image or text they already have. The researchers call this “blind tool invocation.” It slows systems down and adds noisy information that can make reasoning worse. The main idea here is simple: train the agent to be selective about tool use, not just to be good at the task.
The authors diagnose why existing reinforcement learning (RL) methods struggle. Typical approaches combine task accuracy and a penalty for tool use into one reward signal. That creates a dilemma. If the penalty is strong, the agent becomes too shy and stops using tools when they are truly needed. If the penalty is weak, it is lost in the larger variance of the accuracy signal during the RL advantage calculation and has no practical effect. In short, a single, scalar reward mixes goals in a way that prevents the agent from learning when to abstain from calling tools.
To fix this, the paper introduces Hierarchical Decoupled Policy Optimization, or HDPO. Instead of mixing accuracy and efficiency into one reward, HDPO keeps two separate learning channels. One channel optimizes accuracy across all attempts. The other enforces tool parsimony only on trajectories that are already correct, using a conditional advantage estimation (a way of computing how much better a particular action was than expected, but applied only to correct runs). This separation creates a learning curriculum: the agent first learns to be correct, and then it learns to be economical about external calls. The authors also clean their training data to remove fake or misleading environment behavior so the efficiency signal is trustworthy.
The authors train a model called Metis using HDPO and report large gains. According to the paper, Metis reduces tool invocations by orders of magnitude while improving reasoning accuracy. One highlighted example compares a prior method that called tools on 98% of tasks to Metis calling tools on only 2% of tasks, with better overall performance. The paper states that Metis achieves state-of-the-art results across several benchmarks for multimodal reasoning.