Microsoft Debugs AI Agents with AgentRx

Microsoft Research launches AgentRx, an open-source framework and benchmark for systematically debugging AI agent failures, improving accuracy by over 23%.

Mar 12 at 5:17 PM2 min read
Diagram illustrating the AgentRx framework workflow for diagnosing AI agent failures.

Debugging AI agents is becoming a significant hurdle as they move beyond simple chatbots to manage complex tasks. When these systems fail, tracing the root cause through long, probabilistic, and sometimes multi-agent interactions is an arduous manual process. Microsoft Research aims to solve this with the introduction of the AgentRx framework, an automated system designed to pinpoint the exact moment an agent's trajectory becomes unrecoverable.

Traditional metrics like task completion are insufficient for understanding agent failures. To build reliable and safe AI, developers need to identify the precise failure point and gather evidence. This is crucial for advancing capabilities in areas like managing cloud incidents and navigating complex web interfaces, moving beyond simple debugging AI agents.

AgentRx: A Structured Diagnostic Approach

AgentRx treats agent execution as a traceable system. Instead of relying on a single large language model to guess errors, it employs a multi-stage pipeline. This process begins with normalizing logs from various domains into a common format.

The core innovation lies in constraint synthesis. AgentRx automatically generates executable constraints from tool schemas and domain policies, such as ensuring API responses are valid JSON or that sensitive data isn't deleted without confirmation. These constraints are then evaluated step-by-step, only when their guard conditions apply, producing an auditable log of evidence-backed violations.

Finally, an LLM judge analyzes this validation log and a grounded nine-category failure taxonomy to identify the critical failure step and its root cause. This systematic approach allows for more precise identification of issues compared to current methods for autonomous systems debugging.

A New Benchmark for Agent Failures

To evaluate AgentRx, Microsoft Research developed a benchmark comprising 115 manually annotated failed trajectories. These cover diverse domains including structured API workflows (τ-bench), real-world incident management (Flash), and multi-agent web tasks (Magentic-One).

The researchers also derived a nine-category failure taxonomy, categorizing issues like 'Plan Adherence Failure' and 'Invention of New Information' (hallucination). This taxonomy helps distinguish subtle differences in how agents go wrong, contributing to greater AI agent transparency.

Key Results and Open Source Release

In experiments, AgentRx demonstrated significant improvements over prompting baselines, achieving a +23.6% increase in failure localization accuracy and a +22.9% improvement in root-cause attribution. The framework and the benchmark dataset are now open-sourced to foster community development of more reliable AI agents.