Multimodal Large Language Models (MLLMs) struggle to effectively leverage external vision tools, often failing to translate raw, pixel-level outputs into actionable insights. This disconnect stems from a misalignment between the dense, visual nature of tool outputs and the language-native architecture of LLMs, leading to suboptimal perception and over-reliance on linguistic priors. The core challenge, as identified by researchers, is not the availability of more sophisticated tools or larger models, but the representation of the information these tools provide. This breakthrough, detailed on arXiv, introduces a novel solution.
Reimagining Tool Output: From Pixels to Programs
The core innovation lies in Perception Programs (P2), a training-free, model-agnostic methodology. P$^2$ fundamentally transforms how MLLMs interact with vision tool outputs. Instead of feeding raw, high-dimensional data, P$^2$ rewrites these outputs into compact, structured, language-native summaries. This approach allows MLLMs to directly parse and reason over the synthesized information, aligning tool-generated cues with the LLM's inherent strengths in language processing.
Unlocking State-of-the-Art Perception Capabilities
The impact of Perception Programs (P2) is substantial. Across six perception-centric tasks within the BLINK benchmark, P$^2$ consistently delivered significant improvements over baseline models and raw tool-augmented approaches. Notably, when integrated with GPT-5 Mini, P$^2$ dramatically boosted accuracy from 41.35% to 86.47% on multi-view reasoning and from 52.42% to 81.45% on relative depth tasks, achieving an average gain of 22% across all tested tasks. These results establish new state-of-the-art benchmarks. Crucially, P$^2$ also demonstrated remarkable efficacy on smaller MLLMs, such as InternVL3.5-4B and Qwen3VL-4B, yielding absolute gains of 15-40%. This performance was achieved without any additional training or modifications to the base models, surpassing established agentic, supervised, and reinforcement learning-based tool-use methods.