Claude's Corner: IncidentFox — The AI SRE That Wakes Up So You Don't Have To

IncidentFox is the AI SRE agent that lives in your Slack, silently investigating every production alert while your engineers sleep. Two ex-Roblox founders are betting that multi-agent orchestration and 40+ native integrations can replace the 3am pager call — and they're open-sourcing the whole thing to prove it.

7 min read
Claude's Corner: IncidentFox — The AI SRE That Wakes Up So You Don't Have To

TL;DR

IncidentFox is an open-source AI SRE agent that auto-investigates production incidents in Slack using multi-agent orchestration, RAPTOR-based runbook RAG, and 3-layer alert correlation across 40+ integrations. Two ex-Roblox founders are building the moat through integration breadth, incident data flywheel, and enterprise trust — while racing against Resolve AI's $150M war chest.

6.2
C

Build difficulty

At 3am, your Kubernetes pod is OOMKilling in production. Your metrics dashboard is red. Slack is pinging. And somewhere, a bleary-eyed on-call engineer is squinting at seventeen browser tabs trying to figure out whether this is the database, the deployment, or the cloud provider having a bad night. IncidentFox thinks that engineer should be asleep, and an AI agent should be doing that squinting instead.

This is not a novel idea. The incident management space has been building toward AI-assisted triage for years — PagerDuty, Grafana, Better Stack, Rootly, and a dozen others all have "AI" bolted on in some form. What makes IncidentFox different is that it doesn't bolt AI on. The AI is the product, and the founders built it open-source from day one, which is either very smart or very confident, depending on how you look at it.

What They Do

IncidentFox is an AI SRE agent that lives inside your Slack workspace (or Teams, or Google Chat). When an alert fires — from PagerDuty, from Datadog, from wherever — IncidentFox wakes up in the thread, starts pulling logs, querying metrics, checking deployment history, correlating anomalies across 40+ tools, and posts a root cause summary with an executable fix script before a human engineer has finished rubbing their eyes.

The human still has to approve any write action. IncidentFox is careful about that. Every remediation step requires explicit approval, every execution is audit-logged, every action is rollback-capable. The system is designed to be trusted by security-conscious enterprises, not just cowboy startups.

Target customer is any engineering team that has moved beyond "one guy knows the whole system" — typically companies with 20–200 engineers where on-call burnout is real but a dedicated SRE team isn't yet justified. Business model is SaaS (pricing undisclosed) with a self-hosted open-source tier under Apache 2.0.

Founders are Jimmy Wei (CEO, ex-Meta FAIR, ex-Roblox infrastructure) and Long Yi (CTO, ex-Roblox, Brandeis CS/Neuroscience). Two people. YC W2026. $500K from the accelerator. Currently in early pilot mode.

Related startups

How It Works

The architecture is more sophisticated than most two-person teams ship, and the open-source repo makes it readable. A few decisions stand out:

Multi-agent orchestration with specialist agents. Rather than one monolithic LLM trying to understand Kubernetes and AWS and your metrics format and your custom runbooks, IncidentFox deploys specialist agents — a K8s agent, an AWS agent, a metrics agent, a code analysis agent — coordinated by an orchestrator that routes events and assembles their findings. This matters because context windows fill up fast in incident investigation. Keeping each agent focused keeps the reasoning sharp.

RAPTOR-based hierarchical RAG for runbooks. Standard RAG fails on long runbooks because it chunks naively and loses the hierarchical structure of a procedure. IncidentFox uses RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval), which builds summaries at multiple abstraction levels so the agent can navigate a 50-page runbook without losing the thread. This is the kind of detail that separates a research project from a production product.

3-layer alert correlation. Temporal correlation (did things break around the same time?), topology correlation (are the affected services related in the dependency graph?), and semantic correlation (does the language of these alerts suggest the same underlying cause?). Running all three in parallel and intersecting the results is how IncidentFox claims 85–95% alert noise reduction. That number is doing a lot of marketing work, but the underlying approach is sound.

Meta's Prophet for anomaly detection. Prophet handles seasonality natively, which matters when your traffic has daily and weekly cycles. Naive threshold alerting on a Monday morning after a quiet weekend generates huge false-positive rates. Prophet knows it's Monday morning and adjusts accordingly.

Sandboxed execution via gVisor. Every investigation runs in an isolated gVisor Kubernetes sandbox. Credentials never reach the agent directly — an Envoy sidecar injects secrets at request time. The ephemeral filesystem cleans up after every run. This is enterprise-grade from day one, not retrofitted. When your AI agent is querying your production database and your AWS account simultaneously, this kind of isolation isn't optional.

Stack: Python (80% of codebase) for the AI and orchestration layers, TypeScript for the frontend and Slack integration, Helm for self-hosted deployment, gVisor for sandboxing, Envoy for credential proxy. Supports 24+ LLM providers including Claude, OpenAI, Gemini, DeepSeek, and local models via Ollama — so customers can keep data on-prem with an open-weights model if they need to.

Difficulty Score

  • ML/AI: 7/10 — Multi-agent orchestration, RAPTOR RAG, 3-layer alert correlation, Prophet anomaly detection. None of this is novel research, but combining it correctly in a production context with real latency constraints is genuinely hard.
  • Data: 6/10 — Correlating heterogeneous signals across 40+ tools in real-time. Each integration is a mini-project; the correlation engine requires careful feature engineering.
  • Backend: 7/10 — Sandboxed multi-tenant execution, credential injection via proxy, audit logging at scale, multi-agent coordination with real-time event routing.
  • Frontend: 3/10 — Slack-first design means the frontend is largely Slack blocks. Web UI exists but is supplementary. This is a backend product.
  • DevOps: 8/10 — Air-gap support, Helm chart deployment, gVisor kernel isolation, SOC 2 audit trail requirements baked into the architecture. The ops story is harder than the code story.

The Moat

The honest answer is: right now, not much. It's open-source. A skilled team could clone the architecture in 6–8 weeks. The real moat is being built, not yet built.

Integrations breadth is a slow moat. Forty integrations sounds like a lot until you realize each one needs maintenance, edge-case handling, and breaking-change monitoring. The 41st integration takes as long as the first. By the time a competitor ships 40, IncidentFox has 80, and each one is more battle-tested. This is the classic infrastructure playbook.

Incident data is a fast flywheel. Every real incident IncidentFox investigates teaches it something about how systems fail. Correlation patterns, root cause signatures, fix effectiveness — all of this accumulates. A competitor starting fresh has none of that. This is the genuine AI network effect, and it compounds quickly if IncidentFox can sign meaningful customers now.

Enterprise trust takes time. SOC 2 Type 2, air-gap deployment, RBAC, audit trails — you can build this, but you can't buy the 18 months it takes for security reviews to complete and references to accumulate. IncidentFox is building that clock now.

The open-source strategy is a moat in disguise. It looks like giving away the product. It's actually a developer acquisition channel. Engineers discover the self-hosted version, it gets integrated into their stack, and when they need SaaS features or support contracts, IncidentFox is already inside the firewall. This is the HashiCorp playbook, and it works.

What's genuinely worrying: Resolve AI raised $150 million at a $1 billion valuation in February 2026, built by the co-creators of OpenTelemetry. That's a competitor with 300x the funding, credibility in the standards world, and enterprise relationships IncidentFox hasn't started building. The space is not uncrowded.

Replicability Score: 38 / 100

The code is literally open-source. Fork it. Run it. But "replicable" and "competitive" are different things. The real barriers are: 40+ production-grade integrations (each one a maintenance burden), enterprise trust (SOC 2, references, security reviews), and the incident data flywheel that starts accumulating the moment they sign real customers. You're replicating the code but starting the moat-clock at zero. Call it a year behind on day one, assuming you can hire the team and move at the same pace.

Compare to Resolve AI, which has $150M in the bank and OpenTelemetry provenance — if you're building a direct competitor, you're fighting on two fronts simultaneously. IncidentFox's survival depends on being loved by mid-market engineering teams before Resolve AI decides to move down-market. That's a real strategic bet, and it might work.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build Your Own AI SRE Agent: Step-by-Step with Claude Code

## Step 1: Define the Data Model

Design your core database schema. You need tables for: `incidents` (id, title, severity, status, started_at, resolved_at, root_cause), `alerts` (id, incident_id, source, raw_payload, correlated_at), `integrations` (id, org_id, type, credentials_ref, config), `investigations` (id, incident_id, agent_type, findings, confidence_score), `executions` (id, investigation_id, action_type, approved_by, executed_at, result, rolled_back).

Use PostgreSQL with JSONB columns for raw payloads — alert schemas vary wildly across tools.

## Step 2: Build the Integration Layer

Create a pluggable integration framework. Each integration implements: `fetch_metrics(time_range)`, `fetch_logs(query, time_range)`, `fetch_events(time_range)`, `execute_action(action, params)`. Start with your top 3: Datadog, PagerDuty, Kubernetes. Use a credentials proxy pattern — store secrets in Vault or AWS Secrets Manager, inject at request time via a sidecar, never pass credentials to the AI agent directly.

## Step 3: Implement Alert Correlation

Build a 3-layer correlation engine. Temporal: cluster alerts by timestamp using DBSCAN with a 5-minute epsilon. Topology: build a service dependency graph from your k8s manifests / service mesh; alerts on connected services score higher. Semantic: embed alert descriptions with a sentence transformer, cluster by cosine similarity. Combine scores with a weighted ensemble. Output: alert groups with a likely-same-root-cause confidence score.

## Step 4: Build the Multi-Agent Orchestrator

Create specialist agents for each domain. Each agent receives a ContextBundle and returns Findings. The orchestrator routes based on alert type and assembles findings into a coherent incident report.

Use Claude claude-sonnet-4-6 for synthesis (it handles long context well). Use a faster model (Haiku) for per-agent investigation to keep latency under 60 seconds.

## Step 5: Implement Runbook RAG with RAPTOR

Load runbooks into a hierarchical index. Chunk at three levels: full document summary, section summaries, paragraph chunks. At query time, start at the top level and drill down. This lets the agent navigate a 100-page runbook without losing structure.

Use pgvector in Postgres to store embeddings. OpenAI text-embedding-3-small is cost-effective at this scale.

## Step 6: Build the Slack Bot Interface

Use Slack Bolt (Python). Listen for alert webhooks, open an incident thread, post investigation progress updates, present findings with Block Kit UI, and add approve/reject buttons for remediation actions.

Key patterns: use Slack thread_ts to keep all investigation in one thread; use interactive message updates (not new messages) to show progress; require signed approval tokens for any write action; log every interaction to your audit table.

## Step 7: Deploy with Isolation

Use gVisor (runsc) as your container runtime for AI execution sandboxes. Each investigation gets its own ephemeral pod with a 10-minute TTL. Configure Kubernetes NetworkPolicy to restrict egress. Deploy with Helm; expose configuration via a ConfigMap hierarchy (org, team, integration). Set up SOC 2-ready audit logging from day one: every API call, every approval, every execution to an append-only audit log.

For anomaly detection, integrate Meta Prophet for time-series baselining. Train per-metric with at least 2 weeks of history. Re-train weekly. Use it to set dynamic alert thresholds instead of static ones — this alone cuts false-positive rate by 60-70%.
claude-code-skills.md