Artificial Intelligence

Preferred on Google

Microsoft Experts on Debugging Non-Deterministic AI Agents

Microsoft experts Tisha Chawla and Susheem Koul discuss the challenges of debugging AI agents in production and introduce strategies for ensuring replayability and observability.

Jun 29 at 1:02 AM8 min read

Presentation slide showing 'Your Agent Failed in Prod. Good Luck Reproducing It.' with speakers Tisha Chawla and Susheem Koul from Microsoft. — Tisha Chawla and Susheem Koul of Microsoft discuss reproducing AI agent failures.· AI Engineer

In the complex world of AI agents, failures in production can be notoriously difficult to reproduce, creating a significant hurdle for developers aiming to ensure reliability. Tisha Chawla and Susheem Koul from Microsoft, in their presentation titled "Your Agent Failed in Prod. Good Luck Reproducing It.," tackle this critical challenge head-on. They delve into the underlying causes of these elusive bugs and offer practical strategies for effective debugging, emphasizing a shift in focus from absolute determinism to robust replayability and observability.

Microsoft Experts on Debugging Non-Deterministic AI Agents - AI Engineer — Microsoft Experts on Debugging Non-Deterministic AI Agents — from AI Engineer

Visual TL;DR. AI Agent Failures causes Sources of Non-Determinism. Sources of Non-Determinism addressed by Chronicle Approach. Chronicle Approach enables Ensuring Replayability. Chronicle Approach enables Enhancing Observability. Ensuring Replayability leads to Productionizing AI Agents. Enhancing Observability leads to Productionizing AI Agents. Sources of Non-Determinism implies Shift Focus.

Related startups

AI Agent Failures: difficult to reproduce in production environments
Sources of Non-Determinism: sampling vs. system determinism, hardware/software variations
Chronicle Approach: Microsoft's strategy for debugging AI agents
Ensuring Replayability: making agent behavior reproducible for analysis
Enhancing Observability: gaining insight into agent's internal state
Productionizing AI Agents: strategies for reliable deployment and maintenance
Shift Focus: from absolute determinism to replayability/observability

Visual TL;DRQuickExplainDeeper

Understanding the Sources of Non-Determinism

Chawla and Koul highlight several key factors that contribute to the unpredictable behavior of AI agents, particularly in production environments. One primary culprit is the concept of sampling determinism versus system determinism. While setting a model's temperature to zero aims for deterministic output, the underlying hardware and software can still introduce variations. Specifically, they point to issues like float addition not being associative, where the order of operations can lead to minute differences in calculations that cascade into different argmax outputs. Furthermore, batch invariance, where a model processes data in batches, can introduce subtle differences based on the batch composition. The presentation also touches upon MoE routing jitter, noting that in Mixture-of-Experts models, the routing decisions can depend on the batch, leading to inconsistent pathways and outputs for the same input.

The core message is that striving for perfect, bitwise determinism through APIs is often a futile endeavor. Instead, the focus should be on achieving replayability. This means being able to reconstruct the exact sequence of events that led to a failure, even if the underlying computations are not perfectly identical each time.

The Chronicle Approach to Debugging AI Agents

To address these challenges, Microsoft's Chronicle framework offers a robust solution. The presenters explain that Chronicle operates by capturing the full execution context of an agent at each critical juncture, referred to as 'boundaries.' This detailed recording, or 'tracing,' allows for the reconstruction of the agent's state and actions at any given point during its run. This is achieved by capturing what enters and leaves each node in the agent's workflow, not just the prompts or the final outputs.

The presentation contrasts two fundamental approaches to checking agent behavior: deterministic checks and behavioral checks.

Deterministic checks, often implemented as guardrails, focus on the control flow and ensure that specific, expected outputs are produced for given inputs. This involves freezing the recorded context as a 'fixture' and replaying the tool calls with the same inputs. The key here is to assert that the tool output matches what was observed during the initial run. This method is particularly valuable for ensuring that core functionalities remain consistent and are rerunnable and free from external dependencies.

Behavioral checks, on the other hand, are more concerned with the overall meaning and quality of the agent's output. These checks involve replaying the scenario, analyzing the prompt and wording, and scoring the output based on its semantic correctness. The meaning of the output is prioritized over the exact bytes. This approach is crucial for evaluating whether the agent is truly understanding and responding appropriately to user requests, even if the exact phrasing or internal state might differ slightly between runs.

Key Takeaways for Productionizing AI Agents

Chawla and Koul summarize their insights into actionable advice for developers:

Stop chasing bitwise determinism through the API: Instead, focus on capturing and replaying the agent's execution context.
Pin every variable against the session: Ensure all relevant variables, such as model version, prompt details, and any retrieved data, are logged and associated with the specific session for accurate debugging.
Capture the full envelope at the boundary, not just the prompt: This means recording all inputs, outputs, and intermediate states of each agent component.
Replay to debug; fix the failure; keep the envelope as a test case: The captured traces serve as invaluable assets for reproducing bugs, diagnosing issues, and creating regression tests to prevent future occurrences.
Keep generation-time variation alive: While determinism is often pursued, some level of controlled variation can be beneficial. The key is to manage this variation through robust tracing and replay mechanisms, rather than trying to eliminate it entirely.

By adopting these principles and leveraging tools like Chronicle, development teams can navigate the complexities of debugging AI agents, ensuring their reliable performance in production environments.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Tisha Chawla #Susheem Koul #Microsoft #AI Agents #Debugging #Replayability #Observability #LLMs #Software Engineering #NASDAQ:MSFT

AI Daily Digest

Get the most important AI news daily.

+40k readers