Agents are Robots Too: The Infrastructure Imperative for Next-Gen AI

6 min read
Building Agents Robotics

The prevalent misconception that perception is the primary hurdle and planning a mere afterthought has historically stymied progress in self-driving technology. As Jesse Hu, a seasoned ML engineer and founder of Abundant, powerfully argued, "Everyone thought perception was hard and planning was easy. It took 8-10 years to learn we had it backwards." This profound insight from robotics, he contends, is precisely the pattern repeating itself in the nascent field of AI agents, where the focus on predictive models overshadows the intricate demands of robust action and execution.

Hu, drawing on his extensive background at Google’s YouTube and Waymo, presented a compelling case for re-evaluating our approach to building AI agents during his talk for AI Code 2025. His core argument centers on the surprising parallels between robotics and digital agents, highlighting critical lessons about embodiment, statefulness, simulation, and the often-underestimated importance of infrastructure over raw model performance. Abundant, his current venture, applies these large-scale reinforcement learning and simulation techniques to developing sophisticated coding agents.

One of Hu's foundational insights is the "1% vs 99% Problem." While the core AI model might represent a mere 1% of the system's complexity, the remaining 99% encompasses the vast ecosystem required for real-world application. In robotics, this includes sensors, actuators, integration, deployment, monitoring, simulation, and the entire training pipeline. For digital agents, this "body" translates to tools, APIs, terminals, browsers, entire operating systems, logging, and observability. The offline stack, encompassing continuous training, fine-tuning, robust simulation environments, and human feedback loops, becomes paramount. "The winning team not just having the best model and the best online stack, but having the best offline stack because that enables developers to be much faster and ship much more reliably," Hu asserted, underscoring that infrastructure, not just algorithmic brilliance, determines success.

The distinction between closed-loop and open-loop systems is another critical parallel. In self-driving, a closed-loop system allows a car to turn its wheel, measure the actual turn, adjust as needed, and receive continuous feedback on its position. This iterative process ensures precise execution. Conversely, many current agent interactions, such as executing a bash command, resemble an open-loop system: a command is issued, but there's no inherent mechanism to measure its completion, adjust if it's off course, or receive immediate feedback. This often leads to hung processes and failures, emphasizing the urgent need for agents to operate within closed-loop control systems.

Furthermore, agents currently grapple with the "Clock Rate Problem" and time discretization. Self-driving systems operate with explicit control loops at high frequencies (e.g., 50 Hz), constantly checking sensors, planning trajectories, and executing controls in real-time. Agents, however, largely reason in sequence, assuming discrete turns and operating with an implicit clock rate. This sequential approach means agents might execute a command and merely "hope it worked" rather than receiving instantaneous feedback. This limits their ability to react to dynamic changes, such as a sudden pop-up in a browser, leading to potential missteps and cascading errors, a challenge well-documented in robotics.

The choice of input and action spaces also carries profound implications. In robotics, designers consciously decide whether to use monocular video or point clouds for sensing, and whether to act with coarse XY coordinates or fine-grained velocities and accelerations. For agents, these decisions manifest in how they perceive their environment (e.g., full file contents, directory listings, or a character-by-character terminal stream) and how they act (e.g., one token at a time, one tool call, or mouse/keyboard inputs at 30 frames per second). These conscious design choices dictate the agent's flexibility and capability.

The transition from stateless to stateful operations is another significant evolutionary step. Just as a real car in autonomous driving exists persistently in space and time, accumulating history and influencing future actions, advanced agents require statefulness. Unlike simple, one-shot conversational agents that start fresh with each interaction, multi-turn agents and stateful virtual machines (VMs) retain memory, allowing processes to run persistently and files to be stored. This statefulness introduces immense complexity but is vital for agents to handle intricate, long-running tasks and build coherent context over time.

A major pitfall, both in robotics and agents, is the "DAgger and Out of Distribution" problem, or overfitting to human demonstrations. Training an agent solely on human-provided examples, much like teaching Mario Kart by only observing perfect human drives, leaves it vulnerable to failure when encountering novel situations. As Hu illustrated, a browser agent trained without encountering a specific pop-up will get confused and make wrong clicks, leading to "cascading errors." This highlights that "actions have consequences," and simply mimicking human behavior is insufficient for robust real-world performance.

The profound implication of actions having consequences, especially in a messy real world, necessitates a robust simulation environment. Production environments are expensive and carry real risks. Test tracks offer limited scenarios. Simulation, however, provides a safe, fast, and reproducible space to test millions of scenarios. Hu emphasized, "Once you have: Statefulness, Actions with consequences, A messy real world, You NEED simulation." This allows for playing out counterfactuals, exploring various outcomes, and iteratively refining agent behavior before real-world deployment.

The critical difference between predictive models and action models underscores many of these challenges. Predictive models, like large language models in a chat interface, excel at generating coherent responses based on context. They aim for perfect prediction and coherent reasoning. Action models, however, demand closed-loop control, error recovery, and meticulous state management. This is where the bulk of the engineering work lies. The initial focus on perception in self-driving, similar to the current fascination with predictive capabilities in LLMs, obscured the deeper complexities of planning and execution.

Related Reading

Both self-driving and coding agents benefit from what Hu terms "The Lucky Interface." These domains offer predefined machines or terminals, electronic controls or interfaces, built-in telemetry or observability, and standardized actions. This structured environment makes it comparatively easier to develop agents than in domains lacking such inherent organization. This "luck" has propelled early successes but also hints at the steeper challenges awaiting agents in less structured, less observable real-world tasks.

Ultimately, the path to truly capable AI agents involves a continuous "hillclimbing at scale" loop: learning, simulating, deploying, and analyzing logs that then feed back into further learning. This iterative cycle, grounded in real-world data and refined through extensive simulation, is essential for incremental progress. With current automation rates on real work hovering around 2.5%, Hu aptly concluded, "This is where self-driving was in 2015. We're at the beginning." The journey toward robust, autonomous agents demands a shift in focus from merely impressive predictive capabilities to building comprehensive, resilient infrastructure that enables agents to act, learn, and recover effectively in an unpredictable world.