Agentic multimodal models, while advancing interaction with external environments, are hampered by a critical meta-cognitive deficit: the inability to judiciously decide between using internal knowledge and querying external utilities. This often leads to inefficient, reflexive tool invocation, creating significant latency and reasoning errors. Existing reinforcement learning methods struggle with this, often creating a trade-off where penalizing tool use too heavily inhibits necessary actions, while a mild penalty is lost in the noise of accuracy rewards.
Reframing Tool Efficiency from Scalar to Conditional
The proposed HDPO framework fundamentally shifts the paradigm for tool efficiency, moving it from a competing scalar objective to a strictly conditional one. By eliminating reward scalarization, HDPO establishes two independent optimization pathways: one focused on maximizing task correctness and another on enforcing execution economy, but only within successful reasoning trajectories. This conditional advantage estimation ensures that efficiency is pursued without compromising accuracy.
HDPO: Inducing a Cognitive Curriculum for Self-Reliance
This decoupled architecture naturally guides agent development through a cognitive curriculum. Agents are compelled to first master task resolution using their internal capabilities before progressively refining their reliance on external tools. The resulting model, Metis, demonstrates a substantial reduction in tool invocations, orders of magnitude, while simultaneously improving reasoning accuracy, as detailed in the arXiv preprint. This breakthrough addresses a core bottleneck in the development of sophisticated agentic multimodal models.