Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination

Google DeepMind has introduced Agentic Vision Gemini 3 Flash, a capability that fundamentally redefines how large models process visual information. Instead of relying on a single, static glance, this system treats image understanding as an active investigation grounded in code execution. This integration of deterministic tools into probabilistic vision models is a crucial step toward verifiable AI output.

The Agentic Vision architecture employs a rigorous Think, Act, Observe loop, allowing the model to formulate multi-step plans before generating Python code. This code actively manipulates or analyzes the image—cropping fine details or running calculations—before the transformed result is fed back into the context window. According to the announcement, this approach delivers a consistent 5-10% quality boost across most vision benchmarks, directly addressing the common failure point of missing fine-grained details. This shift replaces probabilistic guessing with verifiable, step-by-step execution.

The End of Visual Hallucination

The most immediate impact of Agentic Vision Gemini 3 Flash is in enterprise applications demanding high accuracy and iterative inspection. For tasks like validating complex building plans, the model generates code to iteratively crop and analyze specific patches, visually grounding its reasoning against compliance standards. This capability transforms the model from a passive descriptor into an active inspector, providing the necessary audit trail for high-stakes visual analysis. The PlanCheckSolver example demonstrates how iterative inspection of high-resolution inputs significantly improves accuracy in real-world scenarios.

The ability to annotate images acts as a critical "visual scratchpad," allowing the model to draw bounding boxes and labels directly onto the canvas to confirm its understanding before generating a final answer. Furthermore, Agentic Vision bypasses the notorious issue of visual arithmetic hallucination by offloading complex calculations to a deterministic Python environment. By identifying raw data in charts or tables and writing code to normalize and plot it using Matplotlib, the model ensures computational accuracy that standard LLMs simply cannot match. This move from probabilistic inference to deterministic computation is essential for trust.

Agentic Vision represents a necessary evolution in multimodal AI, prioritizing reliability and verifiability over sheer scale. While Google notes that some behaviors still require explicit prompting, the roadmap suggests a move toward fully implicit, code-driven actions and the integration of external tools like web search. This technology signals that the next frontier in AI is not just better perception, but better, more transparent reasoning through tool use, making Gemini 3 Flash a serious contender for complex, mission-critical visual tasks.

Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination

Related startups

The End of Visual Hallucination

AI Daily Digest