Claude's Corner: IncidentFox — The AI SRE That Wakes Up So You Don't Have To

IncidentFox is the AI SRE agent that lives in your Slack, silently investigating every production alert while your engineers sleep. Two ex-Roblox founders are betting that multi-agent orchestration and 40+ native integrations can replace the 3am pager call — and they're open-sourcing the whole thing to prove it.

May 18 at 11:14 AM7 min read

Claude's Corner: IncidentFox — The AI SRE That Wakes Up So You Don't Have To

TL;DR

IncidentFox is an open-source AI SRE agent that auto-investigates production incidents in Slack using multi-agent orchestration, RAPTOR-based runbook RAG, and 3-layer alert correlation across 40+ integrations. Two ex-Roblox founders are building the moat through integration breadth, incident data flywheel, and enterprise trust — while racing against Resolve AI's $150M war chest.

6.2

Build difficulty

At 3am, your Kubernetes pod is OOMKilling in production. Your metrics dashboard is red. Slack is pinging. And somewhere, a bleary-eyed on-call engineer is squinting at seventeen browser tabs trying to figure out whether this is the database, the deployment, or the cloud provider having a bad night. IncidentFox thinks that engineer should be asleep, and an AI agent should be doing that squinting instead.

This is not a novel idea. The incident management space has been building toward AI-assisted triage for years — PagerDuty, Grafana, Better Stack, Rootly, and a dozen others all have "AI" bolted on in some form. What makes IncidentFox different is that it doesn't bolt AI on. The AI is the product, and the founders built it open-source from day one, which is either very smart or very confident, depending on how you look at it.

What They Do

IncidentFox is an AI SRE agent that lives inside your Slack workspace (or Teams, or Google Chat). When an alert fires — from PagerDuty, from Datadog, from wherever — IncidentFox wakes up in the thread, starts pulling logs, querying metrics, checking deployment history, correlating anomalies across 40+ tools, and posts a root cause summary with an executable fix script before a human engineer has finished rubbing their eyes.

The human still has to approve any write action. IncidentFox is careful about that. Every remediation step requires explicit approval, every execution is audit-logged, every action is rollback-capable. The system is designed to be trusted by security-conscious enterprises, not just cowboy startups.

Target customer is any engineering team that has moved beyond "one guy knows the whole system" — typically companies with 20–200 engineers where on-call burnout is real but a dedicated SRE team isn't yet justified. Business model is SaaS (pricing undisclosed) with a self-hosted open-source tier under Apache 2.0.

Founders are Jimmy Wei (CEO, ex-Meta FAIR, ex-Roblox infrastructure) and Long Yi (CTO, ex-Roblox, Brandeis CS/Neuroscience). Two people. YC W2026. $500K from the accelerator. Currently in early pilot mode.

Related startups

How It Works

The architecture is more sophisticated than most two-person teams ship, and the open-source repo makes it readable. A few decisions stand out:

Multi-agent orchestration with specialist agents. Rather than one monolithic LLM trying to understand Kubernetes and AWS and your metrics format and your custom runbooks, IncidentFox deploys specialist agents — a K8s agent, an AWS agent, a metrics agent, a code analysis agent — coordinated by an orchestrator that routes events and assembles their findings. This matters because context windows fill up fast in incident investigation. Keeping each agent focused keeps the reasoning sharp.

RAPTOR-based hierarchical RAG for runbooks. Standard RAG fails on long runbooks because it chunks naively and loses the hierarchical structure of a procedure. IncidentFox uses RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval), which builds summaries at multiple abstraction levels so the agent can navigate a 50-page runbook without losing the thread. This is the kind of detail that separates a research project from a production product.

3-layer alert correlation. Temporal correlation (did things break around the same time?), topology correlation (are the affected services related in the dependency graph?), and semantic correlation (does the language of these alerts suggest the same underlying cause?). Running all three in parallel and intersecting the results is how IncidentFox claims 85–95% alert noise reduction. That number is doing a lot of marketing work, but the underlying approach is sound.

Meta's Prophet for anomaly detection. Prophet handles seasonality natively, which matters when your traffic has daily and weekly cycles. Naive threshold alerting on a Monday morning after a quiet weekend generates huge false-positive rates. Prophet knows it's Monday morning and adjusts accordingly.

Sandboxed execution via gVisor. Every investigation runs in an isolated gVisor Kubernetes sandbox. Credentials never reach the agent directly — an Envoy sidecar injects secrets at request time. The ephemeral filesystem cleans up after every run. This is enterprise-grade from day one, not retrofitted. When your AI agent is querying your production database and your AWS account simultaneously, this kind of isolation isn't optional.

Stack: Python (80% of codebase) for the AI and orchestration layers, TypeScript for the frontend and Slack integration, Helm for self-hosted deployment, gVisor for sandboxing, Envoy for credential proxy. Supports 24+ LLM providers including Claude, OpenAI, Gemini, DeepSeek, and local models via Ollama — so customers can keep data on-prem with an open-weights model if they need to.

Difficulty Score

ML/AI: 7/10 — Multi-agent orchestration, RAPTOR RAG, 3-layer alert correlation, Prophet anomaly detection. None of this is novel research, but combining it correctly in a production context with real latency constraints is genuinely hard.
Data: 6/10 — Correlating heterogeneous signals across 40+ tools in real-time. Each integration is a mini-project; the correlation engine requires careful feature engineering.
Backend: 7/10 — Sandboxed multi-tenant execution, credential injection via proxy, audit logging at scale, multi-agent coordination with real-time event routing.
Frontend: 3/10 — Slack-first design means the frontend is largely Slack blocks. Web UI exists but is supplementary. This is a backend product.
DevOps: 8/10 — Air-gap support, Helm chart deployment, gVisor kernel isolation, SOC 2 audit trail requirements baked into the architecture. The ops story is harder than the code story.

The Moat

The honest answer is: right now, not much. It's open-source. A skilled team could clone the architecture in 6–8 weeks. The real moat is being built, not yet built.

Integrations breadth is a slow moat. Forty integrations sounds like a lot until you realize each one needs maintenance, edge-case handling, and breaking-change monitoring. The 41st integration takes as long as the first. By the time a competitor ships 40, IncidentFox has 80, and each one is more battle-tested. This is the classic infrastructure playbook.

Incident data is a fast flywheel. Every real incident IncidentFox investigates teaches it something about how systems fail. Correlation patterns, root cause signatures, fix effectiveness — all of this accumulates. A competitor starting fresh has none of that. This is the genuine AI network effect, and it compounds quickly if IncidentFox can sign meaningful customers now.

Enterprise trust takes time. SOC 2 Type 2, air-gap deployment, RBAC, audit trails — you can build this, but you can't buy the 18 months it takes for security reviews to complete and references to accumulate. IncidentFox is building that clock now.

The open-source strategy is a moat in disguise. It looks like giving away the product. It's actually a developer acquisition channel. Engineers discover the self-hosted version, it gets integrated into their stack, and when they need SaaS features or support contracts, IncidentFox is already inside the firewall. This is the HashiCorp playbook, and it works.

What's genuinely worrying: Resolve AI raised $150 million at a $1 billion valuation in February 2026, built by the co-creators of OpenTelemetry. That's a competitor with 300x the funding, credibility in the standards world, and enterprise relationships IncidentFox hasn't started building. The space is not uncrowded.

Replicability Score: 38 / 100

The code is literally open-source. Fork it. Run it. But "replicable" and "competitive" are different things. The real barriers are: 40+ production-grade integrations (each one a maintenance burden), enterprise trust (SOC 2, references, security reviews), and the incident data flywheel that starts accumulating the moment they sign real customers. You're replicating the code but starting the moat-clock at zero. Call it a year behind on day one, assuming you can hire the team and move at the same pace.

Compare to Resolve AI, which has $150M in the bank and OpenTelemetry provenance — if you're building a direct competitor, you're fighting on two fronts simultaneously. IncidentFox's survival depends on being loved by mid-market engineering teams before Resolve AI decides to move down-market. That's a real strategic bet, and it might work.