"A lot of AI is ultimately software engineering with different vocabulary and a little bit of non-determinism," posits Jason Davenport, a sentiment echoed by Aja Hammerly, both from Google Cloud Tech. Their recent "AI Agent Dance Off" on the "Real Terms for AI" series offered a compelling whiteboard comparison of two distinct architectural approaches to building AI coding agents. The discussion, aimed at demystifying the complexities of AI development for a discerning audience of founders, VCs, and AI professionals, centered on leveraging Large Language Models (LLMs) for task planning, developing effective code generation and evaluation loops, and integrating contextual information to enhance agent performance, particularly through the lens of Test-Driven Development (TDD).
Aja's initial design for an AI coding agent presented a straightforward, almost intuitive, workflow. A user's prompt, such as "build a calculator," initiates the process, leading an LLM to formulate a plan. This plan then flows into a "Gen Code" module, which generates the necessary code. The generated code proceeds to an "Exec Code" function for execution. Crucially, any errors or output from the execution phase are fed back directly to the "Gen Code" module, creating an iterative loop for refinement until the code ideally functions as intended. Once a successful result is achieved, it cycles back to the original LLM and then to the user, culminating in a seemingly happy outcome.
However, this seemingly elegant simplicity quickly revealed potential vulnerabilities. Jason astutely probed the limitations of such a direct feedback loop, questioning what happens if the "Gen Code" and "Exec Code" modules become trapped in an infinite loop of errors and attempted fixes. More fundamentally, he asked, "what happens if the generated code doesn't actually address what we've asked for in our original plan?" This highlights a critical insight: a purely reactive, error-correction loop, without higher-level oversight, risks producing functionally correct but ultimately irrelevant code. Aja acknowledged these issues, proposing a modification where errors and outputs loop back to the central LLM itself, enabling the agent to evaluate progress, adjust its overarching plan, and provide more informed input for code generation. This introduces a necessary layer of meta-cognition, allowing the agent to self-correct at a strategic level, rather than merely tactical.
Jason’s architectural paradigm, while sharing foundational principles with Aja’s, immediately introduced a more robust and nuanced approach by emphasizing context and a multi-tiered evaluation strategy. He starts by augmenting the initial prompt with comprehensive "Context," encompassing the existing codebase, established rules, and Model Context Protocol (MCP) elements. This critical step provides the LLM with the necessary background knowledge upfront, preventing the agent from operating in a vacuum. As Jason articulated, "a junior dev isn't going to be able to write good code if they don't know what I'm actually expecting." This initial contextualization is a key insight, ensuring the AI agent possesses the equivalent of institutional knowledge before embarking on a task.
Following this contextualization, Jason's LLM is tasked with formulating a "high-level plan," breaking down complex coding tasks into manageable, sequential steps. This foresight addresses the inherent complexity of real-world development, acknowledging that "any reasonably complex coding task isn't going to be just a single-step to actually get there." Each step of this high-level plan then enters a localized "Plan -> Eval -> Execute" loop, similar in concept to Aja's refined model. The "Execute" phase in Jason’s design is empowered by a suite of tools, including Java, linting, formatting, style checkers, and direct execution capabilities, reflecting the diverse requirements of modern software development.
Related Reading
- Agentic AI and Infrastructure Drive the Next Wave of Tech Investment
- DMN: The Blueprint for Reliable AI Decision Agents
- OpenAI's DevDay Innovations: Agents, Apps, and the Future of AI Development
A pivotal distinction in Jason's architecture lies in its sophisticated evaluation mechanism, particularly the integration of Test-Driven Development (TDD) principles. His model incorporates two distinct evaluation stages: a per-task "Eval" that feeds back into the planning for the *next* step, and an overarching "Eval" that assesses progress against the initial, high-level goals. This layered evaluation is another core insight. The final "Eval" is where TDD truly shines, as Jason explains, "we're starting with the end in mind with our plan." By generating tests *before* writing the code, the agent is explicitly guided by the desired outcomes, ensuring that the generated solution not only works but directly addresses the problem statement. This preventative measure significantly mitigates the risk of producing functionally correct but misaligned code, a challenge identified in simpler designs. Furthermore, he noted that "sometimes having less context for individual steps can reduce distractions and sometimes ultimately produce better code," underscoring the value of focused, modular execution within a larger, well-defined framework.
The "AI Agent Dance Off" ultimately revealed that while both Aja and Jason champion the iterative "plan, execute, evaluate" loop, the devil is in the details of orchestration and contextual integration. Aja’s journey from a direct feedback loop to one involving the central LLM for strategic evaluation underscored the necessity of intelligent oversight. Jason’s more intricate design, with its emphasis on pre-contextualization, high-level planning, and a multi-faceted evaluation incorporating TDD, illustrates a path toward more robust, goal-oriented, and less error-prone AI agent development. Their discussion highlighted that effective AI agent design isn't just about chaining powerful LLMs, but about architecting sophisticated workflows that mirror and even enhance human development practices, addressing the inherent non-determinism with structured intelligence.

