Artificial Intelligence

Preferred on Google

Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains

Phil Hetzel of Braintrust discusses the fundamental differences between traditional observability and the specialized needs of AI agent evaluation.

May 29 at 12:04 AM8 min read

Phil Hetzel presenting on agent observability — AI Engineer

Visual TL;DR. AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) proposes Bridging the Gap. Bridging the Gap involves Human Expertise Role. Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality.

AI Agent Challenges: non-determinism and data deluge in AI agent traces
Traditional Observability: tools not built for complex AI agent behavior
Phil Hetzel (Braintrust): expert on AI agent evaluation and observability
Bridging the Gap: moving from technical to functional agent quality
Human Expertise Role: essential for understanding nuanced agent performance
New Agent Observability: specialized tools and approaches are necessary
Ensuring Agent Quality: better monitoring and evaluation of AI agents

Visual TL;DRQuickExplainDeeper

In the rapidly evolving world of AI, understanding and monitoring the performance of agents is paramount. Phil Hetzel, Head of Solution Engineering at Braintrust, recently shed light on the critical differences between traditional observability and the emerging field of agent observability. Speaking at an AI Engineer event, Hetzel outlined the unique challenges and considerations that come with evaluating and ensuring the quality of AI agents, emphasizing that a new set of tools and approaches are necessary.

Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains - AI Engineer — Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains — from AI Engineer

Who is Phil Hetzel?

Phil Hetzel brings a wealth of experience to the discussion, with twelve years spent in consulting and implementation roles. Previously, he led the global Databricks business unit at Slalom. His background has equipped him with a deep understanding of how to effectively manage and scale complex systems. Hetzel's personal interests include playing chess and spending time with his dachshund, Pistol Pete, as pictured in his presentation.

The Core Challenge: Non-Determinism in AI Agents

Hetzel began by highlighting a fundamental problem: agents are non-deterministic. Unlike traditional applications that follow predictable code paths, AI agents can produce a wide variety of outputs and behaviors even with the same input. This inherent variability makes traditional observability methods, which are designed to measure deterministic metrics and code paths, insufficient for evaluating agent performance.

Traditional observability typically focuses on metrics such as uptime, technical performance, bug detection, and computational cost. The core question is often, "Is the system operational?" However, for AI agents, the question extends to evaluating the functional quality of their outputs. Hetzel pointed out that while traditional tools can measure things like latency and error counts, they struggle to capture the nuances of an agent's behavior, such as its factuality, relevance, or adherence to brand guidelines.

The Data Deluge: Agent Traces are 'Nasty'

A significant challenge highlighted by Hetzel is the sheer volume and complexity of agent traces. He described agent traces as potentially "nasty" because they are highly semi-structured and can contain vast amounts of unstructured text data. A single agent trace, representing a complex interaction, can be gigabytes in size, encompassing numerous model calls, tool executions, and conversational turns.

This presents a systems problem: how to efficiently ingest, index, and query this massive and often unstructured data to extract meaningful insights. Traditional observability platforms, designed for smaller, more structured data, are not equipped to handle this scale and complexity without significant adaptation. Hetzel illustrated this with an example of an agent trace that might include multiple LLM calls, each with its own input and output, along with tool usage and intermediate reasoning steps.

Bridging the Gap: From Technical to Functional Quality

Hetzel argued that AI observability must go beyond simply monitoring technical metrics. It needs to assess the functional quality of the agent's output. This involves understanding not just how fast an agent responds or how much it costs, but also whether its responses are accurate, relevant, helpful, and aligned with the desired brand persona.

He presented a diagram illustrating the different components involved in evaluating agent quality. On one side, technical metrics like time to first token, total tokens, and duration are measured. On the other side, functional metrics such as factuality, tool use, faithfulness, and context relevance are assessed. Hetzel emphasized that both sets of metrics are crucial for a comprehensive understanding of agent performance.

The Role of Human Expertise

A key takeaway from Hetzel's presentation was the indispensable role of human expertise in agent observability. While automated metrics are valuable, they often fall short of capturing the full picture of an agent's performance. Human annotators and subject matter experts are vital for:

Uncovering failure modes of an agent.
Annotating and judging traces to provide qualitative feedback.
Implementing agent quality checks that go beyond simple quantitative measures.
Rerunning production inputs in offline evaluations to simulate real-world scenarios.

Hetzel suggested that by involving a diverse range of people, including non-technical personas like domain experts or researchers, teams can gain a more holistic view of how their agents perform and identify areas for improvement that purely technical metrics might miss.

What's Next for Agent Observability?

Looking ahead, Hetzel highlighted that the goal is to achieve the ability to operate AI agents at scale, which requires an evaluation platform that can:

Deliver insights into the "unknown unknowns" of agent behavior.
Be easy to onboard for both humans and agents.

He concluded by emphasizing that building such a comprehensive evaluation and observability system is a complex undertaking, involving numerous components and considerations, from data ingestion and indexing to the integration of both technical and functional quality metrics. The future of reliable AI agents, he suggested, depends on mastering this intricate process.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Phil Hetzel #Braintrust #Observability #AI Agents #AI Engineering #Machine Learning