Browser Agents Need Better Eyes, Not Models

Kushan Raj of ARK argues that browser agents need better visual understanding ('eyes') rather than just more powerful AI models to navigate the web efficiently.

4 min read
Kushan Raj of ARK presenting on browser agents' need for better visual understanding.
Kushan Raj of ARK discussing the importance of visual understanding for AI browser agents.· AI Engineer

Kushan Raj of ARK argues that the current limitations in browser agents are not due to a lack of sophisticated AI models, but rather a deficiency in their ability to 'see' and interpret the web environment effectively. In a demonstration, Raj showcases how even advanced models can falter when navigating complex websites with pop-ups, interactive elements, and unexpected user flows. The core of his argument is that browser agents need better 'eyes', a more robust understanding of visual context and page structure, to perform reliably.

Browser Agents Need Better Eyes, Not Models - AI Engineer
Browser Agents Need Better Eyes, Not Models — from AI Engineer

The Challenge of Browser Navigation for AI Agents

Raj presents a 'Browser Navigation Challenge' designed to test the capabilities of AI agents. This challenge involves a series of steps, often obscured by pop-up messages, quizzes, and other distractions. The goal is for the agent to identify and interact with the correct elements, a task that proved difficult for the agent demonstrated. The agent, even when using a relatively powerful model, struggled to proceed efficiently, taking a significant amount of time to complete simple actions like clicking a button.

Related startups

This inefficiency, Raj explains, stems from the agent's difficulty in parsing the visual information and understanding the context of the webpage. Elements like cookie consent pop-ups, hidden buttons, and dynamic content changes can easily confuse an agent that relies solely on its model's predictive capabilities without a strong grasp of the visual presentation.

Rethinking Agent Design: From Models to Perception

The central thesis is that the focus in developing better browser agents should shift from simply upgrading AI models to improving the agent's perceptual capabilities. Raj suggests that instead of trying to build larger and more complex models, developers should concentrate on providing agents with a more intuitive and comprehensive understanding of the web page's structure and visual elements.

He highlights how a more efficient representation of web content, such as converting the DOM into a markdown format, can drastically reduce the token count required for an AI agent to process a webpage. This compression allows the agent to grasp the essential information more quickly and with less computational overhead. The presenter shows how a markdown representation of a simple webpage can be around 1100 tokens, whereas the full DOM might require 20,000 tokens or more.

Real-World Examples and Implementation Insights

To illustrate his point, Raj demonstrates two real-world scenarios. The first involves an AI agent attempting to download an Aadhaar card from a government website. While the agent eventually succeeds, it encounters delays and requires multiple steps, including taking screenshots and scrolling, to complete the task. This highlights the agent's struggle with the website's interface and the need for a more direct and efficient interaction method.

The second example involves booking a trek on a Karnataka tourism website. Here, the agent faces challenges with the website's layout, which is in Kannada, a language the agent may not be fully proficient in. The agent gets stuck trying to select a date, underscoring the importance of visual cues and a more flexible interaction approach that doesn't solely rely on precise text recognition or model-based predictions.

Raj also touches upon the technical implementation, mentioning the use of tools and techniques that capture DOM mutations in real-time. This allows for a granular understanding of how the webpage changes, enabling the agent to react more appropriately to dynamic content and user interactions. He emphasizes that the agent needs to be able to discern what is happening on the page, not just predict what should happen based on its model.

Ultimately, Raj advocates for a future where browser agents are not just powerful language models, but also possess sophisticated 'eyes' that can accurately perceive, interpret, and interact with the visual and structural complexities of the internet, leading to faster, more reliable, and more efficient automation.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.