Bridging Vision Tools and LLMs with P2

Perception Programs (P2) transforms raw vision tool outputs into structured summaries, dramatically enhancing MLLM reasoning without retraining.

2 min read
Diagram illustrating the Perception Programs (P2) workflow, showing raw tool output being transformed into structured summaries for MLLM consumption.
Perception Programs (P2) translates raw vision tool outputs into structured summaries, enabling enhanced MLLM reasoning.

Multimodal Large Language Models (MLLMs) struggle to effectively leverage external vision tools, often failing to translate raw, pixel-level outputs into actionable insights. This disconnect stems from a misalignment between the dense, visual nature of tool outputs and the language-native architecture of LLMs, leading to suboptimal perception and over-reliance on linguistic priors. The core challenge, as identified by researchers, is not the availability of more sophisticated tools or larger models, but the representation of the information these tools provide. This breakthrough, detailed on arXiv, introduces a novel solution.

Reimagining Tool Output: From Pixels to Programs

The core innovation lies in Perception Programs (P2), a training-free, model-agnostic methodology. P$^2$ fundamentally transforms how MLLMs interact with vision tool outputs. Instead of feeding raw, high-dimensional data, P$^2$ rewrites these outputs into compact, structured, language-native summaries. This approach allows MLLMs to directly parse and reason over the synthesized information, aligning tool-generated cues with the LLM's inherent strengths in language processing.

Unlocking State-of-the-Art Perception Capabilities

The impact of Perception Programs (P2) is substantial. Across six perception-centric tasks within the BLINK benchmark, P$^2$ consistently delivered significant improvements over baseline models and raw tool-augmented approaches. Notably, when integrated with GPT-5 Mini, P$^2$ dramatically boosted accuracy from 41.35% to 86.47% on multi-view reasoning and from 52.42% to 81.45% on relative depth tasks, achieving an average gain of 22% across all tested tasks. These results establish new state-of-the-art benchmarks. Crucially, P$^2$ also demonstrated remarkable efficacy on smaller MLLMs, such as InternVL3.5-4B and Qwen3VL-4B, yielding absolute gains of 15-40%. This performance was achieved without any additional training or modifications to the base models, surpassing established agentic, supervised, and reinforcement learning-based tool-use methods.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.