The escalating cost of IT downtime, potentially thousands of dollars per minute, highlights a critical challenge for Site Reliability Engineers (SREs) who often face complex system anomalies in the dead of night. Martin Keen, a Master Inventor at IBM, recently illuminated a path forward, detailing how agentic AI can transform anomaly detection and resolution processes by moving beyond brute-force data analysis. He explained how this advanced approach significantly reduces Mean Time To Repair (MTTR) through intelligent context curation and automation.
The core issue with traditional incident response lies in the sheer volume of telemetry data—metrics, events, logs, and traces (MELT)—generated by modern IT environments. SREs are forced to manually sift through this "noisy data" to identify problems, pinpoint root causes, and devise resolutions. A common misconception, Keen argues, is that simply feeding all this raw data into a large language model (LLM) will magically yield answers.
"If you pipe that firehose straight into the large language model, and then ask it to come up with a cause, well, welcome to hallucination city," Keen states unequivocally. This is because LLMs, by design, rely on statistical patterns to predict plausible words, not to verify facts. Overloading them with unrelated noise inevitably leads to fabricated causal links and imaginative narratives that are utterly unhelpful in a crisis.
The real breakthrough, according to Keen, is "context curation." This strategic filtering of data is powered by "topology-aware correlation," where an observability platform maintains a real-time map of how all services connect and depend on each other. When an incident alert triggers, the AI agent intelligently pulls only the telemetry data from components directly involved, rather than sifting through irrelevant information. "It uses this dependency graph just to pull in the telemetry data only from the components that are actually involved."
This curated context then feeds into the AI agent's perceive-reason-act-observe cycle. The agent perceives the incident, reasons to form a hypothesis about the probable root cause, systematically requests additional data to validate or refine this hypothesis, and ultimately pinpoints the most likely culprit. Crucially, this process is accompanied by "explainability," offering a transparent chain of thought and supporting evidence for human operators to review.
For resolution, agentic AI provides four key forms of assistance. It generates validation steps to confirm the identified root cause, produces a step-by-step runbook for remediation, builds automation workflows from suggested actions, and automatically documents the entire incident progress and post-incident review. This comprehensive support allows SREs to quickly follow scripts and execute remediation steps, even if they are not deeply familiar with a particular component.
The overarching benefit is a dramatic reduction in MTTR. "All of this leads to a substantial reduction in the all-important MTTR, that's Mean Time To Repair," Keen emphasizes. By augmenting human SREs with precise, context-aware analysis and actionable solutions, agentic AI minimizes operational stress and mitigates the costly impact of unexpected system outages. "These agents operate under human oversight, they're augmenting rather than replacing human decision-makers," ensuring reliable and smarter IT operations.

