Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains

Phil Hetzel of Braintrust discusses the fundamental differences between traditional observability and the specialized needs of AI agent evaluation.

9 min read
Phil Hetzel presenting on agent observability
AI Engineer

In the rapidly evolving world of AI, understanding and monitoring the performance of agents is paramount. Phil Hetzel, Head of Solution Engineering at Braintrust, recently shed light on the critical differences between traditional observability and the emerging field of agent observability. Speaking at an AI Engineer event, Hetzel outlined the unique challenges and considerations that come with evaluating and ensuring the quality of AI agents, emphasizing that a new set of tools and approaches are necessary.

Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains - AI Engineer
Agent vs. Traditional Observability: Braintrust's Phil Hetzel Explains — from AI Engineer

Visual TL;DR. AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) proposes Bridging the Gap. Bridging the Gap involves Human Expertise Role. Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality.

  1. AI Agent Challenges: non-determinism and data deluge in AI agent traces
  2. Traditional Observability: tools not built for complex AI agent behavior
  3. Phil Hetzel (Braintrust): expert on AI agent evaluation and observability
  4. Bridging the Gap: moving from technical to functional agent quality
  5. Human Expertise Role: essential for understanding nuanced agent performance
  6. New Agent Observability: specialized tools and approaches are necessary
  7. Ensuring Agent Quality: better monitoring and evaluation of AI agents
Visual TL;DR
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by advocates enables AI Agent Challenges Traditional Observability Phil Hetzel (Braintrust) New Agent Observability Ensuring Agent Quality From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by advocates enables AI AgentChallenges TraditionalObservability Phil Hetzel(Braintrust) New AgentObservability Ensuring AgentQuality From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by advocates enables AI Agent Challenges non-determinism and data deluge in AIagent traces Traditional Observability tools not built for complex AI agentbehavior Phil Hetzel (Braintrust) expert on AI agent evaluation andobservability New Agent Observability specialized tools and approaches arenecessary Ensuring Agent Quality better monitoring and evaluation of AIagents From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by advocates enables AI AgentChallenges non-determinism anddata deluge in AIagent traces TraditionalObservability tools not built forcomplex AI agentbehavior Phil Hetzel(Braintrust) expert on AI agentevaluation andobservability New AgentObservability specialized toolsand approaches arenecessary Ensuring AgentQuality better monitoringand evaluation ofAI agents From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) proposes Bridging the Gap. Bridging the Gap involves Human Expertise Role. Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by proposes involves advocates enables AI Agent Challenges non-determinism and data deluge in AIagent traces Traditional Observability tools not built for complex AI agentbehavior Phil Hetzel (Braintrust) expert on AI agent evaluation andobservability Bridging the Gap moving from technical to functional agentquality Human Expertise Role essential for understanding nuanced agentperformance New Agent Observability specialized tools and approaches arenecessary Ensuring Agent Quality better monitoring and evaluation of AIagents From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agent Challenges contrasts with Traditional Observability. Traditional Observability explained by Phil Hetzel (Braintrust). Phil Hetzel (Braintrust) proposes Bridging the Gap. Bridging the Gap involves Human Expertise Role. Phil Hetzel (Braintrust) advocates New Agent Observability. New Agent Observability enables Ensuring Agent Quality contrasts with explained by proposes involves advocates enables AI AgentChallenges non-determinism anddata deluge in AIagent traces TraditionalObservability tools not built forcomplex AI agentbehavior Phil Hetzel(Braintrust) expert on AI agentevaluation andobservability Bridging the Gap moving fromtechnical tofunctional agent… Human ExpertiseRole essential forunderstandingnuanced agent… New AgentObservability specialized toolsand approaches arenecessary Ensuring AgentQuality better monitoringand evaluation ofAI agents From startuphub.ai · The publishers behind this format

Who is Phil Hetzel?

Phil Hetzel brings a wealth of experience to the discussion, with twelve years spent in consulting and implementation roles. Previously, he led the global Databricks business unit at Slalom. His background has equipped him with a deep understanding of how to effectively manage and scale complex systems. Hetzel's personal interests include playing chess and spending time with his dachshund, Pistol Pete, as pictured in his presentation.

The Core Challenge: Non-Determinism in AI Agents

Hetzel began by highlighting a fundamental problem: agents are non-deterministic. Unlike traditional applications that follow predictable code paths, AI agents can produce a wide variety of outputs and behaviors even with the same input. This inherent variability makes traditional observability methods, which are designed to measure deterministic metrics and code paths, insufficient for evaluating agent performance.

Related startups

Traditional observability typically focuses on metrics such as uptime, technical performance, bug detection, and computational cost. The core question is often, "Is the system operational?" However, for AI agents, the question extends to evaluating the functional quality of their outputs. Hetzel pointed out that while traditional tools can measure things like latency and error counts, they struggle to capture the nuances of an agent's behavior, such as its factuality, relevance, or adherence to brand guidelines.

The Data Deluge: Agent Traces are 'Nasty'

A significant challenge highlighted by Hetzel is the sheer volume and complexity of agent traces. He described agent traces as potentially "nasty" because they are highly semi-structured and can contain vast amounts of unstructured text data. A single agent trace, representing a complex interaction, can be gigabytes in size, encompassing numerous model calls, tool executions, and conversational turns.

This presents a systems problem: how to efficiently ingest, index, and query this massive and often unstructured data to extract meaningful insights. Traditional observability platforms, designed for smaller, more structured data, are not equipped to handle this scale and complexity without significant adaptation. Hetzel illustrated this with an example of an agent trace that might include multiple LLM calls, each with its own input and output, along with tool usage and intermediate reasoning steps.

Bridging the Gap: From Technical to Functional Quality

Hetzel argued that AI observability must go beyond simply monitoring technical metrics. It needs to assess the functional quality of the agent's output. This involves understanding not just how fast an agent responds or how much it costs, but also whether its responses are accurate, relevant, helpful, and aligned with the desired brand persona.

He presented a diagram illustrating the different components involved in evaluating agent quality. On one side, technical metrics like time to first token, total tokens, and duration are measured. On the other side, functional metrics such as factuality, tool use, faithfulness, and context relevance are assessed. Hetzel emphasized that both sets of metrics are crucial for a comprehensive understanding of agent performance.

The Role of Human Expertise

A key takeaway from Hetzel's presentation was the indispensable role of human expertise in agent observability. While automated metrics are valuable, they often fall short of capturing the full picture of an agent's performance. Human annotators and subject matter experts are vital for:

  • Uncovering failure modes of an agent.
  • Annotating and judging traces to provide qualitative feedback.
  • Implementing agent quality checks that go beyond simple quantitative measures.
  • Rerunning production inputs in offline evaluations to simulate real-world scenarios.

Hetzel suggested that by involving a diverse range of people, including non-technical personas like domain experts or researchers, teams can gain a more holistic view of how their agents perform and identify areas for improvement that purely technical metrics might miss.

What's Next for Agent Observability?

Looking ahead, Hetzel highlighted that the goal is to achieve the ability to operate AI agents at scale, which requires an evaluation platform that can:

  • Deliver insights into the "unknown unknowns" of agent behavior.
  • Be easy to onboard for both humans and agents.

He concluded by emphasizing that building such a comprehensive evaluation and observability system is a complex undertaking, involving numerous components and considerations, from data ingestion and indexing to the integration of both technical and functional quality metrics. The future of reliable AI agents, he suggested, depends on mastering this intricate process.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.