In the ever-evolving world of AI, ensuring the reliability and effectiveness of agents is paramount. Phil Hetzel, Head of Solutions Engineering at Braintrust, recently shared his insights on the complexities of building robust evaluation platforms for AI agents. Speaking to a packed audience at AI Engineer Europe, Hetzel emphasized that while the concept of evaluating AI might seem straightforward, the reality is far more intricate, presenting a unique set of challenges for both technical and non-technical teams.
Meet Phil Hetzel: Bridging AI Engineering and Business Needs
Hetzel brings a wealth of experience to the table, with over twelve years in consulting and implementation. His background includes a significant role as a former leader of Slalom's global Databricks business unit, where he honed his skills in data architecture and scaling solutions. This practical experience, combined with his current role at Braintrust, gives him a unique perspective on how to translate complex AI challenges into actionable engineering strategies.
The Core Problem: AI agent variability
Hetzel kicked off his presentation by highlighting a fundamental challenge in AI development: the inherent variability of large language models (LLMs). He explained that these models, while powerful, can produce inconsistent outputs, making it difficult to guarantee predictable performance. "LLMs have extreme variability," Hetzel stated, setting the stage for why dedicated evaluation mechanisms are crucial. As agents become increasingly central to customer interactions, ensuring their quality and reliability before they engage with users is no longer optional.
The Rise of Agents and the Need for Eval Platforms
The conversation then shifted to the growing prevalence of agents in customer-facing roles. Hetzel noted that these agents are rapidly becoming the norm for how businesses interact with their customers. This trend directly amplifies the need for sophisticated evaluation platforms. "You need to become confident with how your agent will perform," he stressed. Without a robust system to test and validate agent behavior, companies risk deploying unreliable AI that could damage customer relationships and brand reputation.
Beyond Spreadsheets: The Complexity of Agent Evaluation
Hetzel presented a compelling analogy of an iceberg to illustrate the complexity of building effective evaluation platforms. The visible tip of the iceberg represents the basic functionality—the UI, input examples, and output display. However, the vast majority of the work, the submerged part of the iceberg, involves a complex interplay of underlying technologies and multi-persona workflows. This includes elements like human annotation, online scoring, prompt engineering, observability, and specialized functions. Hetzel emphasized that simply creating a basic UI on a spreadsheet is insufficient for truly understanding and improving agent performance. "It's way more complicated than that," he asserted, pointing to the need for a systems-level approach that addresses the intricate details of agent behavior.
The "So What?" Problem in AI Evaluation
A key challenge Hetzel identified is the "so what?" problem in evaluation. Simply running tests and collecting scores isn't enough if those insights don't translate into tangible improvements. He highlighted that effective evaluation platforms must not only measure agent quality at scale but also deliver actionable insights. This means identifying patterns, understanding what's breaking, and crucially, determining why and for whom these issues are occurring. Hetzel also stressed the importance of near real-time feedback on application performance and ensuring that the evaluation process is easy to onboard for both human reviewers and the agents themselves. "Your subject matter experts can impact agents," he said, underscoring the collaborative nature of building high-quality AI.
The Flywheel: Connecting Observability and Evals
Hetzel introduced the concept of a "flywheel" to describe the continuous cycle of improvement for AI agents. This flywheel connects observability and evaluation, creating a feedback loop where production traces inform new evaluation datasets, which in turn lead to quality improvements with each deployment. The process involves observing agent behavior, analyzing the data to find patterns and failure modes, evaluating those failures to create new test cases, and finally, improving the agent based on these insights. He elaborated, "Production traces become eval datasets. Quality improves with every deployment. Debugging to proactive and data-driven." This iterative process is essential for building increasingly reliable and effective AI agents over time.
Building the Right System: Beyond Basic Tools
Hetzel touched upon the evolution of Braintrust's own approach, moving from a system that relied on a basic Postgres database and an open-source warehouse to a more sophisticated logging and tracing platform. He acknowledged that while initial solutions might seem sufficient, they often struggle to scale and provide the necessary depth of analysis. "Agent traces are nasty, voluminous, and numerous," he pointed out, highlighting the challenge of managing and querying such data effectively. The goal is to move beyond simple reporting tools and build systems that enable dynamic experimentation, allowing users to adjust prompts and other parameters to understand agent behavior more deeply.
The Path Forward: Continuous Improvement
Looking ahead, Hetzel emphasized that achieving operational excellence with AI agents requires a commitment to continuous improvement. This involves not only building robust evaluation platforms but also fostering a culture of experimentation and data-driven decision-making. The ability to deliver insights into the "unknown unknowns" of agent behavior and to make the evaluation process accessible to all stakeholders will be critical for success in the rapidly advancing field of AI.
