Claude's Corner: Corelayer, The AI On-Call Engineer Goldman Sachs Taught Them to Build

Corelayer (YC W2026) is building an AI on-call engineer for finance, healthcare, and insurance, detecting silent data quality failures that traditional APM tools miss entirely. Here's the architecture, the moat, and how hard it is to clone.

8 min read
Claude's Corner: Corelayer, The AI On-Call Engineer Goldman Sachs Taught Them to Build

TL;DR

Corelayer builds an AI on-call engineer that catches silent data quality failures, bad values, missing rows, unexpected NULLs, that traditional observability tools miss entirely. Founded by Goldman Sachs veterans, it targets regulated industries like finance and healthcare with on-prem deployment and SOC 2 compliance baked in from day one.

6.6
C

Build difficulty

Corelayer: The AI On-Call Engineer Goldman Sachs Taught Them to Build

Every engineering team has a dirty secret: your monitoring is lying to you. Datadog tells you the p99 latency is fine. PagerDuty stays quiet. But somewhere in a Kafka topic, 40,000 rows have a NULL in a column that should never be NULL, and by the time a human notices, three downstream services have ingested garbage and your fintech client's settlement batch is wrong. Corelayer is betting that the next generation of observability isn't about better dashboards, it's about an AI agent that actually debugs the production environment the way a senior engineer would: by looking at the data, not just the metrics.

The Goldman Sachs Origin Story (That Actually Checks Out)

Most YC pitches include a line about "coming from [prestigious place] where we saw the problem firsthand." Most are decoration. Mitch Radhuber and Shipra Jha's is not.

Related startups

The two built data infrastructure together at Goldman Sachs, systems processing hundreds of billions of rows daily across tightly regulated pipelines. They didn't just see the on-call pain, they lived it. The kind of 2 AM incident where the alert fires because a downstream job failed, but the actual root cause is a bad join three hops upstream that introduced duplicate trades six hours earlier.

Radhuber brings CS from UMich and astrophysics research from Princeton, the pattern-finding instincts of a scientist applied to production noise. Jha adds cloud infrastructure experience from Oracle and a CS degree from CMU. Three-person team in SF, legal entity is Sevvy AI Inc., and they're in YC's Winter 2026 batch.

The founding thesis is sharp: the biggest category of production failure in regulated industries isn't infrastructure going down. It's bad data flowing through systems that are technically "healthy" by every metric your observability stack measures.

What Corelayer Actually Does

Corelayer is an AI-native production support platform, but calling it "AI for observability" undersells the actual bet they're making. Traditional APM tools (Datadog, New Relic, Grafana) monitor infrastructure, latency, error rates, CPU, memory. They're good at catching fires. They're blind to slow poison.

The slow poison is data quality. Incorrect values. Missing rows. Unexpected duplicates. A payment processor's database starts writing $0.00 instead of actual amounts for a specific transaction type. The service is technically fine. The data is catastrophically wrong. No alert fires. A human eventually notices. Chaos ensues.

Corelayer's core system, they call it the Production Cortex, integrates across four layers simultaneously:

  • Code repositories, what changed recently, who changed it
  • Databases, the actual data, not just query performance
  • Deployments, what rolled out and when
  • Observability tools, logs, traces, metrics from your existing stack (Datadog, etc.)

When something goes wrong, the agent doesn't fire an alert and wait. It begins debugging. It traces anomalies backwards through the pipeline, correlates with infrastructure events and recent deploys, and surfaces a root cause hypothesis with evidence, inside minutes, not hours. Customers claim it caught what Ridery's CTO calls "Heisenbugs": intermittent failures that disappear when you look directly at them, the kind that drive engineers slowly insane.

The other thing worth flagging: Corelayer learns. Engineers can feed back corrections and confirmations, and the system continuously improves its pattern matching for each customer's specific production environment. It's not a static rules engine pretending to be AI. It builds a model of what "normal" looks like for your system specifically.

Who They're Selling To (And Why Those Customers Are Hard)

Finance, healthcare, insurance. The three industries most likely to say "we can't send production data to a vendor's cloud." Corelayer knew this and built the compliance story upfront: SOC 2 Type II, on-premises deployment support, confidential compute environments, BYOK (bring your own key), zero data retention by default, and full audit trails with citations.

Their customer list is already interesting: Finzly (payments), Broadridge (financial services infrastructure), Ridery, Rilla, Pump, Moda, Hyperspell, Ressio. They claim over a million production error events handled and hundreds of millions of traces processed. For a 3-person team that's been at this for less than a year, that's real traction.

The sales motion for regulated enterprises is a grind, but the wedge is compelling. You don't have to replace Datadog. Corelayer sits on top of your existing observability stack and adds the data-quality layer plus AI reasoning. That's a much easier conversation than "rip out your monitoring and use us instead."

How It Works Under the Hood

The architecture Corelayer is building is essentially a multi-agent system layered over a context graph. The rough components:

Context Graph: A continuously maintained graph of production relationships, which services talk to which tables, which jobs depend on which Kafka topics, which deployments correlated with which incident patterns. This is the system's "memory" of your production environment and is what makes root cause analysis fast instead of O(everything).

Sub-agent swarm for noise filtering: Rather than surfacing every anomaly, sub-agents evaluate business impact before escalation. Not every NULL matters. Not every duplicate is critical. The system tries to understand the business context, "this NULL is in a settlement amount column vs. this NULL is in an optional metadata field", before waking anyone up.

Causal reasoning layer: This is where it gets interesting. The agent doesn't just correlate, it reasons about causality. Recent deploy + anomaly in a downstream service + a change in a specific database column = probable root cause hypothesis with a ranked list of evidence. This is a hard AI problem and likely where most of their secret sauce lives.

Feedback loop: Engineer corrections feed back into the model. This is both a moat-builder and a necessary evil: without it, the signal-to-noise ratio would degrade as false positives accumulate. With it, the system gets better the more it's used.

Integrations: Datadog, Kafka, Slack, Microsoft Teams, CLI, MCP (Model Context Protocol), the last one is smart. MCP means Corelayer can be queried by other AI coding agents as a production-context tool, which positions them well as the "source of truth" layer in agentic engineering workflows.

Difficulty Score

CategoryScoreWhy
ML / AI7/10Causal reasoning across production graphs is genuinely hard. LLM-based root cause analysis is still an unsolved problem at scale. The feedback loop tuning requires real ML craft.
Data8/10Ingesting, normalizing, and correlating telemetry from logs, traces, metrics, AND database contents at production scale is brutal. Schema diversity alone is a nightmare.
Backend7/10Multi-tenant, multi-cloud + on-prem, multi-agent orchestration with latency guarantees for incident response. Not a weekend project.
Frontend4/10Dashboard UI, timeline views, incident investigation UX. Table stakes for B2B SaaS. No technical moat here.
DevOps7/10On-prem deployment support, confidential compute, SOC 2 compliance, RBAC, SCIM, audit trails. Compliance infrastructure is expensive and slow to build.

The Moat: What's Hard to Replicate

The data moat (growing): Every production environment Corelayer connects to teaches it more about what failure patterns look like across different tech stacks and industries. The more incidents they've seen, the better their causal reasoning models get. This compounds over time in a way that's genuinely hard to catch up to.

Regulated industry trust (slow to build): Getting a bank or an insurance company to connect you to production data, even on-prem, requires trust earned over months of security reviews, compliance calls, and proof-of-concept periods. First movers in regulated spaces get to renewals while newcomers are still in security reviews.

Engineer feedback loops (sticky): The system learns from your engineers specifically. After six months of Corelayer learning your team's production environment, switching costs are real, not because it's hard to migrate, but because you'd be throwing away a trained model of your specific system.

What's easy to replicate: The basic architecture, LLM + observability integrations + anomaly detection, is well-understood. Several companies are attacking adjacent problems (Incident.io, FireHydrant, Rootly, PagerDuty's AI features). None of them are doing the data-quality angle with the depth Corelayer is, but they have distribution advantages. The biggest risk isn't a clone; it's Datadog or Splunk shipping a "data quality AI" feature in their existing platform.

Replicability Score: 48 / 100

A strong backend team with observability domain knowledge could clone the surface layer, integrations, alert routing, basic anomaly detection, in a few months. The hard parts are the causal reasoning quality (which requires both ML research investment and training data), the compliance infrastructure for regulated industries, and the customer trust to even connect to production data in finance or healthcare. The feedback loop and context graph depth take years to mature. Not a 100x moat, but not a weekend wrapper either. A well-resourced team with $3, 5M could build a credible v1; closing a Broadridge is a different problem entirely.

The Bull and Bear Cases

Bull: Production data quality is genuinely underserved by every existing observability tool. The regulated industry wedge is defensible. Goldman Sachs pedigree gets them in the door at financial institutions. The MCP integration positions them correctly for the agentic AI wave, when AI coding agents are shipping code to production 100x faster, you need an AI agent watching the results. They're early and right.

Bear: This is an extremely enterprise sales motion for a 3-person team. Deal cycles will be long. Datadog has 1,000 sales reps and can ship a competitive feature. The causal reasoning AI is genuinely hard to make reliable enough that enterprises trust it with production systems, the cost of a false negative (missed incident) or a false positive (alert fatigue) is high. And "AI SRE" is a crowded narrative even if the specific technical approach is differentiated.

The founders know what they're building. They've sat in the on-call rotation that Corelayer is trying to replace. That matters more than it sounds.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Corelayer Clone with Claude Code

## Step 1: Define Your Data Model and Context Graph Schema

Design a PostgreSQL schema to model your production context graph. Core tables:

```sql
CREATE TABLE services (id UUID PRIMARY KEY, name TEXT, repo_url TEXT, deployment_platform TEXT);
CREATE TABLE service_dependencies (from_service UUID REFERENCES services(id), to_service UUID REFERENCES services(id), dependency_type TEXT);
CREATE TABLE deployments (id UUID PRIMARY KEY, service_id UUID REFERENCES services(id), deployed_at TIMESTAMPTZ, commit_sha TEXT, diff_summary TEXT);
CREATE TABLE incidents (id UUID PRIMARY KEY, detected_at TIMESTAMPTZ, resolved_at TIMESTAMPTZ, root_cause TEXT, affected_services UUID[], severity TEXT);
CREATE TABLE anomalies (id UUID PRIMARY KEY, service_id UUID REFERENCES services(id), detected_at TIMESTAMPTZ, anomaly_type TEXT, raw_evidence JSONB, status TEXT DEFAULT open);
CREATE TABLE engineer_feedback (id UUID PRIMARY KEY, anomaly_id UUID REFERENCES anomalies(id), verdict TEXT, notes TEXT, created_at TIMESTAMPTZ DEFAULT NOW());
```

Use pgvector for embedding anomaly descriptions to enable semantic similarity search across historical incidents.

## Step 2: Build the Telemetry Ingestion Pipeline

Create ingestion workers for three signal types:

Infrastructure signals, Poll Datadog, CloudWatch, or Prometheus APIs every 30 seconds. Normalize into a unified metrics table: (service_id, metric_name, value, timestamp).

Log streaming, Subscribe to log aggregators (Loki, CloudWatch Logs, Datadog Logs) via webhooks or polling. Parse structured JSON logs; for unstructured, use an LLM to extract error type, stack trace, and affected component.

Database snapshots, Connect with read-only credentials. Run periodic row-count and value-distribution queries on critical tables. Flag deviations from rolling baseline (NULL rates, value range violations, duplicate keys).

Implement Kafka consumer pattern for customers using Kafka: consume from their topics and apply a schema-aware anomaly detector per topic.

## Step 3: Implement the Anomaly Detection Layer

Build a two-stage detector:

Stage 1, Statistical baseline (fast, cheap): For each metric series, maintain a rolling 7-day EWMA and standard deviation. Flag anything beyond 3 sigma. For data quality, compute per-column statistics (null rate, unique rate, min/max/mean) and alert on threshold deviation.

Stage 2, LLM triage (slower, expensive): For each flagged anomaly, construct a context prompt with the raw anomaly, recent deployment history (last 5 deploys), and top 3 similar historical incidents via pgvector similarity search. Use Claude claude-sonnet-4-6 with prompt caching on the service context to keep costs manageable.

## Step 4: Build the Causal Reasoning Agent

When Stage 2 confidence exceeds 6, spawn a causal reasoning agent, the core product differentiation.

Agent MCP tools: get_recent_deployments(service_id, hours=48), query_database_sample(connection_id, table, column, where_clause), get_dependency_graph(service_id, depth=2), get_historical_incidents(embedding, limit=5), get_log_slice(service_id, start_time, end_time, error_pattern).

The agent runs a ReAct loop: observe anomaly, reason about likely cause, use a tool to gather evidence, update hypothesis, repeat until confident or max 8 steps. Output a structured root cause report with evidence citations.

## Step 5: Build the Noise Filtering and Escalation Layer

Build a business-context evaluator that considers: anomaly description, affected field criticality tier, time of day, and recent false positive rate for the service. Feed engineer_feedback verdicts back as training signal to update per-service false positive rates and fine-tune a small classifier to pre-filter before hitting the expensive LLM triage.

## Step 6: Build the API, Dashboard, and Integrations

REST API (FastAPI or Express): POST /incidents, GET /incidents/{id}/analysis, POST /incidents/{id}/feedback, GET /services/{id}/context-graph, POST /webhooks/datadog.

MCP server, expose as an MCP tool so Claude Code and other AI coding agents can query production context: get_active_incidents(), get_service_health(service_name), get_recent_anomalies(service_name, hours=24).

Slack integration, post incident summaries with root cause hypothesis, confidence score, and one-click feedback button.

Dashboard (Next.js + shadcn/ui), incident timeline, context graph visualization with React Flow, anomaly feed, engineer feedback interface, service health heatmap.

## Step 7: Deploy with Compliance in Mind

On-prem packaging, containerize everything with Docker Compose or Helm. Your agent must run in the customer network. Ship a CLI installer for configuration and secret management.

Data minimization, never log raw production data. Work with statistical summaries and hashed identifiers. Raw data access must be ephemeral.

Audit logging, log every LLM call, database query, and escalation decision: timestamp, trigger, inputs, output, reviewing engineer. Append-only audit table.

SOC 2 prep, infrastructure-as-code, MFA enforcement, VPC flow logs, secrets rotation via Vault or AWS Secrets Manager. Budget 6-9 months and $15-30K for a Type II audit through Vanta or Drata.

For confidential compute: Azure Confidential Computing or AWS Nitro Enclaves let agents process sensitive data inside hardware-backed enclaves that even the vendor cannot access.
claude-code-skills.md