Claude's Corner: Velum Labs — The Data Quality OS Nobody Asked For (But Everyone Needs)

Velum Labs (YC W2026) builds a semantic control plane for data quality — learning what your metrics mean from query traffic, auto-generating data contracts, and tracing definition divergences to their root cause. Here's how it works and how hard it is to replicate.

May 5 at 11:13 AM8 min read

“Why don’t these numbers match?” is the most expensive sentence in any data-driven company. Revenue is $4.2M on the finance dashboard and $3.9M in the product BI tool. Churn is 2.1% or 3.4% depending on which analyst you ask. Your ML model trained on one definition of “active user” is being evaluated against a slightly different one that someone quietly changed six weeks ago without a migration.

The data team spends half their week in Slack threads they didn’t start. Executives stop trusting dashboards. Everyone builds their own version of the truth in Google Sheets. Sound familiar?

Velum Labs — two ex-quantum-computing researchers who appear to have decided that bad data is a harder problem than bad qubits — are building the infrastructure layer that prevents this. Their pitch: stop writing YAML data contracts by hand, and let the system learn what your data means from how your teams actually query it.

It’s the right problem. The question is whether they can execute on the hardest part of it.

What They Do

Velum describes itself as “the OS for data quality across any stack.” That’s marketing-speak for a semantic control plane: a layer that sits between your data and your teams, observes how data flows and gets queried, infers semantic definitions, and enforces them.

The target customer is a data engineer or data team lead at a company with a moderately complex stack — a cloud warehouse (Snowflake, BigQuery, Databricks), a transformation layer (dbt), a handful of BI tools (Looker, Tableau), and at least five teams all querying the same tables with slightly different assumptions baked into their SQL.

The business model is B2B SaaS. Early focus on regulated industries — fintech, healthcare — where “our data was wrong” is not just embarrassing but potentially a compliance violation. They’re already monitoring 200+ tables in production at financial institutions managing $1B+ in assets. For a two-person team in early access, that’s meaningful traction.

How It Works

Velum operates on a four-stage lifecycle: Detect → Diagnose → Fix → Prevent.

Detection is the entry point. Velum observes query traffic patterns, monitors value distributions, and tracks schema changes continuously. It flags anomalies: numeric drift beyond expected bounds, columns going null, silent row drops, duplicate key violations. This part isn’t technically exotic — it’s careful instrumentation and statistical monitoring. The difference is that it requires zero manual setup; Velum derives the baseline from observed production behavior rather than from spec sheets you wrote during onboarding and never updated.

Root Cause Analysis is where it gets interesting. Velum builds a live dependency graph from production query traffic. When “revenue” disagrees between two dashboards, the system traces the divergence upstream through the graph to pinpoint exactly where the definitions split — whether that’s a transform, a join condition, a filter predicate, or a metric definition in a BI tool. Traditional investigation means spending days manually tracing through pipelines and interrogating four different teams about what their SQL actually means. Velum does it in seconds.

Building this graph correctly is the hard part. Query traffic is noisy. The same logical table gets referenced by different aliases across different tools. CTEs nest five levels deep. JOIN conditions encode business logic that nobody documented. Extracting semantic meaning from raw SQL across fifteen-plus dialects is a parsing and inference problem of real complexity, and getting it right requires handling an enormous variety of real-world query patterns.

Fix and Deploy closes the loop. Once Velum identifies a definition divergence, it proposes a unified semantic definition, generates the appropriate dbt migration or SQL transformation, and opens a pull request into your existing CI/CD workflow. You review it before it ships. The system is opinionated about what the correct definition should be, but nothing deploys without human sign-off.

Prevention is the compounding piece. Every real production problem that gets fixed generates an enforceable data contract — a machine-readable rule that catches this same class of break automatically in the future. Contracts aren’t authored by hand from requirements docs that are six months out of date; they’re derived from the actual failures your stack has experienced. The system gets smarter the longer you run it.

The integration matrix is aggressive for a two-person team: Snowflake, BigQuery, Databricks, Redshift, and Postgres on the warehouse side; Kafka, Fivetran, and Airbyte for ingestion; Airflow, Dagster, and Prefect for orchestration; dbt, SQLMesh, and Spark for transforms; Looker, Tableau, Metabase, and Superset for BI. That’s fifteen-plus integrations to maintain. Ambitious or overextended — probably both, and which one it is depends entirely on how fast they can close enterprise deals.

The Founders

Benjamin Muñoz (CEO) has a physics and mathematics background from Stanford, followed by five years building reinforcement learning methods for quantum computing at Harvard and the Max Planck Institute. Alen Rubilar-Muñoz (CTO) is a mathematician with experience in geometric deep learning and analog computing for AI.

These are not typical data infrastructure founders. Most data quality tooling was built by database engineers or enterprise SaaS veterans who came up through Palantir or Databricks. Coming at the problem from ML and physics means thinking about semantic inference as a learning system problem rather than a rules engine problem. That distinction matters: a rules engine catches the breaks you anticipated; a learning system can catch the ones you didn’t.

Difficulty Score

Dimension	Score	Why
ML / AI	6 / 10	Semantic inference and anomaly detection use ML, but the core IP is the dependency graph — more algorithmic than learned
Data	9 / 10	This IS the data product. Integration breadth and lineage correctness are the entire game
Backend	7 / 10	Query interceptors, distributed graph storage, multi-tenant reliability, real-time processing at scale across cloud warehouses
Frontend	4 / 10	Lineage visualization and alerting dashboards matter for adoption, but they’re not the moat
DevOps	7 / 10	15+ integrations, git-native deployment, multi-cloud, enterprise security and audit requirements

The Moat

What’s easy to replicate: Basic data quality checks. Great Expectations has done null/schema/distribution validation for years. dbt tests do column-level assertions natively. Monte Carlo does statistical anomaly detection. You can wire together a reasonable data quality layer from existing open-source tools in a weekend and cover sixty percent of the obvious failure modes.

What’s hard to replicate: The semantic layer.

The genuinely novel piece here isn’t catching nulls — it’s learning that “active users” means different things to marketing, product, and the ML team, and then enforcing a single canonical definition automatically across every tool in the stack. That requires parsing SQL from fifteen-plus dialects and normalizing it into a semantic representation; inferring business logic from query patterns rather than schema shapes; detecting semantic drift when the definition changes, not just the data values; and generating migration code that is actually correct, safe to deploy, and idiomatic to the target tool.

No major competitor does this well. Monte Carlo raised $200M+ and still mostly does schema monitoring and statistical alerts. Soda Core is Great Expectations with better ergonomics. dbt tests require you to write them yourself — and to know what to test for. Velum’s bet is that nobody actually wants to write YAML contracts; they want contracts that write themselves from production reality.

The compounding advantage is the lineage graph. The longer Velum runs in your stack, the deeper and more accurate the dependency graph becomes. Switching costs compound as the graph deepens — an accurate lineage graph that took two years of production traffic to build is not something you can import from a competitor. This is the right kind of moat: one that gets better with time and usage, not just with additional engineering investment.

Regulated industry positioning is strategically smart. A fintech handling $1B in assets that relies on Velum for compliance-grade data accuracy does not swap it out for a twenty-percent-cheaper alternative. Compliance stickiness is among the stickiest enterprise stickiness that exists.

Replicability Score: 52 / 100

The space is not empty. Monte Carlo, Soda, DataFold, Great Expectations, and dbt native tests are real products with real customers and real engineering teams. You could build the basic Velum feature set — schema monitoring, statistical alerts, lineage visualization — in three to six months with a competent team.

The semantic inference layer — learning intent from query patterns, not just data shapes — is the hard part. That takes production data to get right, iteration cycles to tune, and time to build the trust required for a financial institution to let you auto-generate SQL migrations. A weekend project cannot replicate that. A three-person team with real data access could get there in a year or two.

The integration matrix alone represents six to twelve months of engineering work. The live dependency graph built from production query traffic — not manual declaration — is genuinely novel in its approach. The founders’ ML background suggests the semantic layer will evolve in ways that traditional database-engineer-founded competitors are unlikely to anticipate.

What keeps this below 70: the core problem is well-understood, competitors are funded and entrenched, and enterprise data tooling has a long history of better-mousetrap startups that failed because existing integration in the data stack is genuinely difficult to displace, even when the new tool is objectively better.

Velum’s real shot: land early in greenfield AI infrastructure buildouts at regulated companies, compound the lineage graph before anyone else builds a comparable one, and make the semantic layer genuinely irreplaceable. That’s a plausible path. The problem gets worse every quarter as AI data complexity explodes. That tailwind is real, and it is theirs to ride.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build Guide: Data Quality Semantic Control Plane (Velum Labs Clone)

A 7-step guide to building a production-grade data quality OS using Claude Code.

## Step 1: Query Traffic Interceptor

**Goal:** Capture all SQL queries flowing through your data stack without modifying existing pipelines.

**DB Schema:**
```sql
CREATE TABLE query_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  captured_at TIMESTAMPTZ DEFAULT NOW(),
  source_tool TEXT NOT NULL,
  database_name TEXT,
  schema_name TEXT,
  raw_sql TEXT NOT NULL,
  normalized_sql TEXT,
  tables_referenced TEXT[],
  columns_referenced TEXT[],
  team TEXT,
  user_email TEXT,
  execution_ms INTEGER,
  row_count BIGINT
);
```

**Implementation:**
- Snowflake: poll `INFORMATION_SCHEMA.QUERY_HISTORY` on a scheduled job
- Postgres: enable `pg_stat_statements` + `log_min_duration_statement = 0`
- BigQuery: subscribe to Cloud Audit Logs via Pub/Sub
- dbt: use `on-run-end` hook to capture compiled SQL + run metadata
- Use `sqlglot` to normalize SQL and extract table/column references across dialects

```python
import sqlglot

def extract_references(sql: str, dialect: str) -> dict:
    ast = sqlglot.parse_one(sql, dialect=dialect)
    tables = [t.name for t in ast.find_all(sqlglot.expressions.Table)]
    columns = [c.name for c in ast.find_all(sqlglot.expressions.Column)]
    return {"tables": list(set(tables)), "columns": list(set(columns))}
```

## Step 2: Semantic Inference Engine

**Goal:** Learn what each metric means from how teams actually query it.

**DB Schema:**
```sql
CREATE TABLE semantic_definitions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  entity_name TEXT NOT NULL,
  canonical_sql TEXT NOT NULL,
  variant_sqls JSONB DEFAULT '[]',
  teams_using TEXT[],
  confidence FLOAT DEFAULT 0.0,
  is_contested BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);
```

**Key algorithm (definition clustering):**
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

def cluster_metric_definitions(sqls: list[str]) -> list[list[str]]:
    normalized = [sqlglot.transpile(s, write="duckdb")[0].lower() for s in sqls]
    vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5))
    X = vectorizer.fit_transform(normalized)
    labels = DBSCAN(eps=0.3, min_samples=2, metric='cosine').fit_predict(X)
    clusters = {}
    for sql, label in zip(sqls, labels):
        clusters.setdefault(label, []).append(sql)
    return list(clusters.values())
```

## Step 3: Live Dependency Graph

**Goal:** Build a real-time lineage graph mapping data flow from source to dashboard.

**DB Schema:**
```sql
CREATE TABLE lineage_nodes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  node_type TEXT NOT NULL,
  fully_qualified_name TEXT NOT NULL UNIQUE,
  source_tool TEXT,
  schema_json JSONB DEFAULT '{}'
);

CREATE TABLE lineage_edges (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  from_node_id UUID REFERENCES lineage_nodes(id),
  to_node_id UUID REFERENCES lineage_nodes(id),
  edge_type TEXT,
  first_seen TIMESTAMPTZ DEFAULT NOW(),
  last_seen TIMESTAMPTZ DEFAULT NOW(),
  query_count INTEGER DEFAULT 1,
  UNIQUE(from_node_id, to_node_id, edge_type)
);
```

**Graph traversal for RCA:**
```python
import networkx as nx

def find_divergence_root(graph, metric_a, metric_b):
    ancestors_a = set(nx.ancestors(graph, metric_a))
    ancestors_b = set(nx.ancestors(graph, metric_b))
    shared = ancestors_a & ancestors_b
    divergence_points = []
    for node in shared:
        succ_a = set(nx.descendants(graph, node)) & ancestors_a
        succ_b = set(nx.descendants(graph, node)) & ancestors_b
        if succ_a != succ_b:
            divergence_points.append(node)
    return divergence_points
```

## Step 4: Anomaly Detection Engine

**Goal:** Continuously monitor data distributions and flag statistical anomalies without manual thresholds.

**DB Schema:**
```sql
CREATE TABLE column_profiles (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  table_fqn TEXT NOT NULL,
  column_name TEXT NOT NULL,
  profile_date DATE NOT NULL,
  null_rate FLOAT,
  distinct_count BIGINT,
  mean_val FLOAT,
  stddev FLOAT,
  row_count BIGINT,
  UNIQUE(table_fqn, column_name, profile_date)
);

CREATE TABLE anomalies (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  detected_at TIMESTAMPTZ DEFAULT NOW(),
  table_fqn TEXT NOT NULL,
  column_name TEXT,
  anomaly_type TEXT NOT NULL,
  severity TEXT DEFAULT 'warning',
  details JSONB DEFAULT '{}',
  resolved_at TIMESTAMPTZ,
  root_cause_node_id UUID REFERENCES lineage_nodes(id)
);
```

**Z-score drift detection:**
```python
def detect_drift(historical: list[float], current: float, threshold: float = 3.0) -> bool:
    if len(historical) < 7:
        return False
    mean = sum(historical) / len(historical)
    variance = sum((x - mean) ** 2 for x in historical) / len(historical)
    stddev = variance ** 0.5
    if stddev == 0:
        return current != mean
    return abs(current - mean) / stddev > threshold
```

## Step 5: Root Cause Analysis

**Goal:** Given an anomaly, trace it to the upstream source using the dependency graph.

**API Design:**
```
POST /api/v1/rca
{
  "anomaly_id": "uuid",
  "trace_depth": 5
}

Response:
{
  "root_causes": [
    {
      "node": "dbt_model.staging.stg_transactions",
      "change_type": "filter_added",
      "confidence": 0.87,
      "diff": "- WHERE status IN ('paid', 'refunded')\n+ WHERE status = 'paid'"
    }
  ],
  "lineage_path": ["raw.transactions", "stg_transactions", "fct_revenue", "looker.revenue"]
}
```

Track dbt model versions by storing compiled SQL per git SHA. Compare adjacent deploys to detect definition changes automatically.

## Step 6: Data Contract Generator

**Goal:** Auto-generate enforceable data contracts from observed production behavior and resolved incidents.

**DB Schema:**
```sql
CREATE TABLE data_contracts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  table_fqn TEXT NOT NULL,
  column_name TEXT,
  contract_type TEXT NOT NULL,
  rule_sql TEXT,
  derived_from_anomaly_id UUID REFERENCES anomalies(id),
  status TEXT DEFAULT 'active',
  created_at TIMESTAMPTZ DEFAULT NOW()
);
```

**Contract generation:**
```python
def generate_contract_from_anomaly(anomaly: dict) -> dict:
    contract_type = {
        'null_spike': 'not_null',
        'row_drop': 'row_count_floor',
        'distribution_drift': 'range',
        'schema_change': 'column_exists'
    }[anomaly['anomaly_type']]
    rule_sql = build_assertion_sql(
        table=anomaly['table_fqn'],
        column=anomaly['column_name'],
        contract_type=contract_type,
        baseline_stats=anomaly['details']
    )
    return {"table_fqn": anomaly['table_fqn'], "contract_type": contract_type, "rule_sql": rule_sql}
```

Export to dbt YAML, Great Expectations JSON, or Soda YAML for tool-native enforcement.

## Step 7: Git / CI-CD Fix Deployer

**Goal:** Turn proposed fixes into pull requests that flow through normal code review.

**API Design:**
```
POST /api/v1/fixes/propose
{
  "anomaly_id": "uuid",
  "fix_type": "unify_definition",
  "target_tool": "dbt",
  "repo_url": "https://github.com/org/dbt-project",
  "base_branch": "main"
}
```

**Implementation:**
1. Clone repo to ephemeral container using PyGithub
2. Generate diff using Claude API: prompt = given this dbt model and the definition conflict, output only the minimal SQL change to unify the definition
3. Apply diff, run `dbt compile --select affected_model` to validate
4. Open PR via GitHub API with auto-generated description linking to the anomaly
5. Post PR URL back to Velum dashboard for human review

**Deployment stack:**
- Fly.io or Railway for containerized backends
- Postgres for all state (skip Redis until you actually need it)
- Query interceptors as sidecar agents in customer infra (Docker + cron)
- Next.js frontend with Supabase Auth for multi-tenant access control
- OpenTelemetry → Grafana Cloud for observability (free tier covers early customers)

Install for:

claude-code-skills.md

#data quality #YC W2026 #data engineering #dbt #data contracts #Velum Labs #data observability