Claude's Corner: Cajal — The Machine That Checks Its Own Math

Cajal deploys AI agents to discover and formally verify mathematical proofs at scale. Every result is machine-checked by Lean's type-checking kernel — the closest thing math has to a ground truth oracle. Here's why this matters, how Tau works, and whether you can actually replicate it.

8 min read
Cajal homepage screenshot with Claude's Corner badge

TL;DR

Cajal is a YC W2026 startup building a multi-agent AI system that uses Lean 4 to formally verify mathematical proofs at industrial scale. Their business is selling verified proof datasets and reinforcement-learning environments to AI labs training the next generation of math-capable models. The moat is structural — formal verification gives a clean binary reward signal no learned reward model can fake.

Build difficulty

The Machine That Checks Its Own Math

There's a reason mathematicians still spend weeks arguing about whether a proof is valid. Math is hard to get right, and human review doesn't scale. Cajal's bet is simple: AI can now find proofs, and formal verification can guarantee they're correct. Put those two things together and you get something genuinely new — a machine that does mathematics and proves it isn't lying.

That's not a metaphor. Every result Cajal's system produces is checked by Lean's type-checking kernel, a piece of software that is effectively the closest thing mathematics has to a ground truth oracle. If Lean says it's right, it's right. No peer review required.

This is the company that Cajal (YC W2026) is building. Two founders, one shot at becoming the infrastructure layer for provably correct AI reasoning. It's one of the most technically serious W2026 bets, and it's worth understanding why.

What They're Building

Cajal's core product is Tau, a multi-agent system that discovers and formally verifies mathematical proofs at scale. Tau isn't just generating LaTeX that looks plausible — it's writing proofs in Lean 4, a formal proof assistant and programming language where the type checker is the judge. Bad proof? Won't compile. End of story.

That alone would be interesting. But Cajal is also selling the outputs of that system to the people who need them most: frontier AI labs.

The product line has three legs:

  • Verified training datasets — formally verified math corpora in Lean 4, Coq, and Isabelle. This is the kind of data that doesn't exist at scale anywhere. Labs training the next generation of reasoning models need it badly.
  • Evaluations and benchmarks — rigorous Pass@k metrics against verified problem sets. Not vibes, not leaderboards, actual formal verification of whether the model got it right.
  • RL environments — native proof assistant bindings with sub-millisecond latency, purpose-built for reinforcement learning against formal math. This is the plumbing that lets labs train models the way DeepMind trained AlphaProof, but faster and without building it from scratch.

Beyond the AI labs, Cajal is targeting verticals where mathematical correctness is load-bearing: quantum computing, quantitative finance, cryptography, aerospace, robotics, biology. These are industries where a wrong proof isn't an embarrassment — it's a liability.

The business model is B2B all the way down. Partnerships with frontier AI labs on the data and infrastructure side, enterprise contracts with domain-specific organizations that need verified mathematics as part of their core workflow. There's no consumer play here, no freemium, no virality. Just technical credibility sold to people who can evaluate it.

Related startups

How Tau Actually Works

Theorem proving is a search problem. Given a statement you want to prove, you need to find a sequence of valid logical steps that gets you from your axioms to that statement. The search space is enormous — combinatorially explosive in any non-trivial domain. This is why it's hard, and this is where the interesting engineering lives.

Tau is a multi-agent system, which in this context means something more precise than the usual marketing noise around the term. Different agents handle different parts of the search: some propose high-level proof strategies, some generate specific tactic sequences in Lean, some evaluate partial proofs and backtrack, some specialize in particular mathematical domains. They collaborate, they disagree, they check each other's work.

The verification step is what separates this from every other "AI does math" system. When an agent proposes a proof, it's not evaluated by another language model or by a human. It's checked by Lean's kernel — a small, formally verified piece of software that implements the rules of dependent type theory. If the proof is valid, the kernel accepts it. If it isn't, the kernel rejects it and the system tries again. This is a hard feedback signal in a domain that has historically been starved for hard feedback signals.

That feedback loop is also what makes Tau useful as an RL environment. You have a reward signal that is both instant and incorruptible: the proof either checks out or it doesn't. No learned reward model that can be gamed, no human rater who gets tired. The proof checker is the reward function, and it doesn't lie.

The team behind this has the credentials to pull it off. Luke Johnston brings ML and neuroscience from Oxford, Cambridge, and UCL. Pedro Nobre has formal verification and AI expertise — exactly the combination you need when your product sits at the intersection of dependent type theory and modern deep learning. This is not a team that watched a YouTube video about Lean. They understand the underlying mathematics.

The company name is itself a signal about how they're thinking. Santiago Ramón y Cajal was the neuroscientist who first drew neural circuits — hand-illustrated maps of the brain that shaped a century of neuroscience. The founders are nodding at that timescale of scientific importance. Whether that ambition is warranted remains to be seen, but it's not nothing that they're thinking in those terms from day one.

Difficulty Scores

How hard is each dimension of this build? Here's an honest assessment:

  • ML/AI: 9/10 — This is cutting-edge multi-agent theorem proving. Training on formal proof corpora, RLHF against proof checkers, handling the combinatorial search problem at scale — this is not a fine-tune-GPT-4 situation. The research frontier and the product are the same thing.
  • Data: 8/10 — Verified math corpora are among the scarcest, most expensive-to-produce datasets in existence. Building that corpus is Cajal's primary moat and also their primary engineering challenge. You can't scrape your way to this.
  • Backend: 7/10 — Deep Lean integration, proof search algorithms, sub-millisecond RL environment latency. Real systems engineering with very little room for slop. Correctness is non-negotiable when your product is correctness.
  • Frontend: 2/10 — It's early-stage B2B selling to AI researchers and enterprise engineers. The UI is probably a dashboard, an API key, and a Slack channel. Nobody is buying Cajal for the UX.
  • DevOps: 4/10 — Standard cloud infrastructure. Proof checking is computationally intensive but not architecturally exotic. Nothing here that a senior SRE hasn't seen before.

The Moat: What's Real and What Isn't

The cynical read on Cajal is that Lean is open source, AlphaProof proved the approach works, and a well-funded team could replicate the architecture. That's true. The architecture is not secret.

But the moat isn't the architecture. It's three things that are genuinely hard to clone:

The data corpus. Years of curated, formally verified mathematics across multiple proof assistants. Every theorem in that corpus was either written by hand by someone who knows what they're doing, or generated and verified by a system that has already been trained. You can't buy this data. You can't generate it without already having it. This is the most defensible asset Cajal has, and it compounds over time.

The expertise overlap. You need people who are simultaneously strong in formal methods (dependent type theory, Lean's metaprogramming, tactic engines) and modern ML (multi-agent systems, RL from formal feedback, fine-tuning on proof corpora). That Venn diagram is tiny. Axiom raised $200M and is still hiring. DeepMind has an entire research team on this. The talent constraint is real.

First-mover in RL environments. If Cajal gets their RL environment product embedded in a frontier lab's training infrastructure, that's a switching cost. Training pipelines don't get replaced casually. The team that solves the integration problem first has a durable advantage.

What's not a moat: the model architecture, the general approach, the use of MCTS in proof search. These are known. Well-resourced competitors — and they exist, both at big labs and at funded startups — will get there.

The honest summary is that Cajal's moat is time-based, not structural. They need to get far enough ahead that catching up becomes economically irrational. That's a race, and they're in it.

Replicability Score: 72/100

This is a 72, not an 85, because the underlying approach is proven and the tools are open source. A strong ML team with formal methods expertise and serious funding could reproduce the core system. AlphaProof showed the world the recipe. Lean and Coq aren't proprietary. The architecture of Tau — multi-agent proof search with formal verification as the reward signal — is the kind of thing that gets written up in papers.

It's not a 55, because the data moat is real and the expertise requirement is brutal. You're not building this with generalist engineers. The formal verification side alone requires people who have spent years thinking about type theory and proof assistants. The ML side requires people who understand why training on formal proofs is different from training on natural language. The combination is rare enough that it functions as a real barrier.

The 28 points of difficulty that keep this from being fully replicable are concentrated in two places: the corpus of verified mathematics they're building, and the relationships they're establishing with frontier AI labs right now. Both of those compound. Both of those are hard to fast-follow.

Cajal is playing in a space where the largest AI labs in the world are paying attention. That's both validation and threat. The question is whether a two-person team can move fast enough and build deep enough customer relationships to become infrastructure rather than competition. YC's Diana Hu backing them suggests at least one smart person thinks the answer is yes.

Watch this one.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# How to Build a Formal Verification AI Platform (Cajal Clone)

A step-by-step technical guide for building a system that discovers and formally verifies mathematical proofs using multi-agent AI. Each step is scoped for a developer working with Claude Code and modern tooling.

---

## Step 1: Set Up the Proof Assistant Environment

**Goal:** Get Lean 4 and mathlib running, expose them programmatically, and establish your baseline proof-checking infrastructure.

Install Lean 4 via `elan` (the Lean version manager):

```bash
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh
lake new proof_env
cd proof_env && lake add mathlib
```

Mathlib is the Lean community's massive mathematics library — over 150,000 theorems. This is your ground truth corpus and your starting vocabulary. You'll need it.

Build a thin Python wrapper around the Lean REPL (Read-Eval-Print Loop) using the `lean4-repl` project or by spawning Lean processes directly:

```python
# lean_env.py
import subprocess, json

class LeanEnvironment:
    def __init__(self):
        self.proc = subprocess.Popen(
            ["lake", "env", "lean", "--server"],
            stdin=subprocess.PIPE, stdout=subprocess.PIPE
        )

    def check_proof(self, tactic_block: str) -> dict:
        payload = json.dumps({"cmd": tactic_block, "env": 0})
        self.proc.stdin.write((payload + "\n").encode())
        self.proc.stdin.flush()
        return json.loads(self.proc.stdout.readline())
```

**Key metric:** Proof check latency should be under 50ms for simple tactics. Optimize this aggressively — it's your inner loop for everything that follows.

Database schema for tracking proof states:

```sql
CREATE TABLE proof_attempts (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    theorem_id UUID NOT NULL REFERENCES theorems(id),
    tactic_sequence JSONB NOT NULL,
    lean_output TEXT,
    verified BOOLEAN DEFAULT FALSE,
    error_msg TEXT,
    check_latency_ms INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE theorems (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    statement TEXT NOT NULL,
    lean_statement TEXT NOT NULL,
    domain TEXT,  -- 'algebra', 'topology', 'number_theory', etc.
    difficulty_estimate FLOAT,
    source TEXT,
    verified_proof TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
```

---

## Step 2: Build the Proof Search Engine

**Goal:** Implement Monte Carlo Tree Search (MCTS) over the tactic space, using Lean as the state evaluator.

Proof search is a tree problem. Each node is a proof state (a set of goals remaining), each edge is a tactic applied, and success is a leaf with zero remaining goals. MCTS is a strong baseline because it balances exploration (trying novel tactics) with exploitation (following paths that have worked before).

```python
# mcts.py
import math
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ProofNode:
    state: str           # Lean tactic state as string
    tactic: Optional[str] = None
    parent: Optional['ProofNode'] = None
    children: list = field(default_factory=list)
    visits: int = 0
    value: float = 0.0
    is_terminal: bool = False
    is_proved: bool = False

    def ucb_score(self, exploration_weight=1.4) -> float:
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = exploration_weight * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration

class MCTSProofSearch:
    def __init__(self, lean_env, llm_client, num_simulations=500):
        self.lean = lean_env
        self.llm = llm_client
        self.num_simulations = num_simulations

    def search(self, theorem: str) -> Optional[list[str]]:
        root = ProofNode(state=theorem)
        for _ in range(self.num_simulations):
            node = self._select(root)
            result = self._expand_and_simulate(node)
            self._backpropagate(node, result)
            if result['proved']:
                return self._extract_proof_path(node)
        return None

    def _select(self, node: ProofNode) -> ProofNode:
        while node.children and not node.is_terminal:
            node = max(node.children, key=lambda n: n.ucb_score())
        return node

    def _expand_and_simulate(self, node: ProofNode) -> dict:
        # Ask LLM for candidate tactics given current proof state
        tactics = self.llm.suggest_tactics(node.state, n=8)
        for tactic in tactics:
            result = self.lean.apply_tactic(node.state, tactic)
            child = ProofNode(
                state=result['new_state'],
                tactic=tactic,
                parent=node,
                is_terminal=result['is_terminal'],
                is_proved=result['is_proved']
            )
            node.children.append(child)
        return {'proved': any(c.is_proved for c in node.children)}
```

Add beam search as a complementary strategy for simpler theorems where MCTS overhead isn't worth it. Switch between strategies based on estimated theorem difficulty.

---

## Step 3: Train a Proof-Generation Model

**Goal:** Fine-tune a language model specifically on formal proof corpora so it generates valid Lean tactics rather than plausible-looking nonsense.

Start with a strong base model (Qwen2.5-Math or DeepSeek-Prover are solid open-source options). Fine-tune on Lean 4 proof data using next-token prediction on tactic sequences.

Data format for fine-tuning:

```jsonl
{"messages": [
  {"role": "system", "content": "You are a Lean 4 proof assistant. Given a theorem statement and current proof state, suggest the next tactic."},
  {"role": "user", "content": "Theorem: ∀ n : ℕ, n + 0 = n\nCurrent state: ⊢ ∀ n : ℕ, n + 0 = n"},
  {"role": "assistant", "content": "intro n\nsimp [Nat.add_zero]"}
]}
```

After supervised fine-tuning, apply GRPO (Group Relative Policy Optimization) with the Lean kernel as the reward function:

```python
def compute_reward(proof_attempt: list[str], theorem: str, lean_env) -> float:
    result = lean_env.check_full_proof(theorem, proof_attempt)
    if result['verified']:
        return 1.0
    # Partial credit for making progress (fewer goals remaining)
    progress = result.get('goals_closed', 0) / result.get('total_goals', 1)
    return progress * 0.3
```

The reward signal is clean and binary at the top level — the proof either checks out or it doesn't. This is what makes formal verification uniquely powerful for RL: no learned reward model, no reward hacking, no ambiguity.

---

## Step 4: Build the Tau Multi-Agent Orchestration System

**Goal:** Coordinate multiple specialized agents to collaborate on proof discovery.

Different agents handle different parts of the search. Implement a supervisor that routes tasks and aggregates results:

```python
# orchestrator.py
from enum import Enum
from typing import Protocol

class AgentRole(Enum):
    STRATEGIST = "strategist"      # High-level proof plan
    TACTICIAN = "tactician"        # Low-level tactic generation
    CRITIC = "critic"              # Evaluates partial proofs
    SPECIALIST = "specialist"      # Domain expert (algebra, analysis, etc.)
    VERIFIER = "verifier"          # Calls Lean kernel

class ProofAgent(Protocol):
    role: AgentRole
    async def act(self, state: dict) -> dict: ...

class TauOrchestrator:
    def __init__(self, agents: list[ProofAgent], lean_env, max_rounds=50):
        self.agents = {a.role: a for a in agents}
        self.lean = lean_env
        self.max_rounds = max_rounds

    async def prove(self, theorem: str) -> dict:
        state = {
            "theorem": theorem,
            "proof_steps": [],
            "current_goals": [theorem],
            "failed_tactics": [],
            "round": 0
        }

        while state["round"] < self.max_rounds and state["current_goals"]:
            # Strategist sets the plan
            strategy = await self.agents[AgentRole.STRATEGIST].act(state)
            state["strategy"] = strategy["plan"]

            # Tactician generates concrete steps
            tactics = await self.agents[AgentRole.TACTICIAN].act(state)

            # Critic filters bad moves before wasting Lean calls
            filtered = await self.agents[AgentRole.CRITIC].act({
                **state, "proposed_tactics": tactics["tactics"]
            })

            # Apply surviving tactics, verify with Lean
            for tactic in filtered["approved_tactics"]:
                result = self.lean.apply_tactic(
                    state["current_goals"][0], tactic
                )
                if result["success"]:
                    state["proof_steps"].append(tactic)
                    state["current_goals"] = result["remaining_goals"]
                    break
                else:
                    state["failed_tactics"].append(tactic)

            state["round"] += 1

        verified = self.lean.check_full_proof(theorem, state["proof_steps"])
        return {"proof": state["proof_steps"], "verified": verified["success"]}
```

Use a message queue (Redis Streams or RabbitMQ) for agent coordination in production. Each agent is a separate service; the orchestrator is the control plane.

---

## Step 5: Build the Dataset Pipeline

**Goal:** Curate, formalize, and verify mathematical corpora at scale for sale to AI labs.

This is your primary business asset. Build it like it matters — because it does.

Pipeline stages:

1. **Ingest** — Scrape arXiv math papers, ProofWiki, existing Lean/Coq/Isabelle libraries. Parse LaTeX with `latexml` or `plasTeX`.
2. **Formalize** — Use your proof-generation model to translate informal math into Lean 4 statements.
3. **Verify** — Every statement gets checked by the Lean kernel. Failed verifications go to a human review queue or back to the model.
4. **Grade** — Assign difficulty scores, domain tags, and proof complexity metrics.
5. **Deduplicate** — Embedding-based dedup to remove near-identical theorems.

```sql
-- Dataset versioning schema
CREATE TABLE dataset_versions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    version_tag TEXT UNIQUE NOT NULL,  -- 'v1.2.0'
    proof_assistant TEXT NOT NULL,     -- 'lean4', 'coq', 'isabelle'
    theorem_count INTEGER,
    verified_count INTEGER,
    domain_breakdown JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    s3_path TEXT NOT NULL
);

CREATE TABLE theorem_provenance (
    theorem_id UUID REFERENCES theorems(id),
    dataset_version_id UUID REFERENCES dataset_versions(id),
    source_url TEXT,
    formalization_model TEXT,
    human_reviewed BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (theorem_id, dataset_version_id)
);
```

Export in multiple formats: raw Lean files, JSONL for training, Parquet for analytics. Automate nightly builds and publish checksums.

---

## Step 6: Build the API Layer

**Goal:** Expose your RL environments, datasets, and evaluation endpoints to paying customers.

Three distinct API surfaces, each with different latency and throughput requirements.

**RL Environment API** (latency-critical, sub-millisecond target):

```python
# FastAPI with async Lean pool
@app.post("/v1/env/step")
async def env_step(request: StepRequest, api_key: APIKey = Depends(verify_key)):
    env = await lean_pool.acquire(request.env_id)
    result = await env.apply_tactic_async(request.tactic)
    return {
        "observation": result.new_state,
        "reward": 1.0 if result.proved else 0.0,
        "done": result.is_terminal,
        "info": {"goals_remaining": result.goal_count}
    }

@app.post("/v1/env/reset")
async def env_reset(request: ResetRequest, api_key: APIKey = Depends(verify_key)):
    env_id = await lean_pool.spawn(request.theorem)
    return {"env_id": env_id, "observation": request.theorem}
```

Maintain a warm pool of pre-initialized Lean processes. Cold-starting Lean is slow (200–500ms); warm instances check tactics in under 5ms.

**Dataset API** (throughput-optimized):

```python
@app.get("/v1/datasets/{version}/theorems")
async def get_theorems(
    version: str,
    domain: Optional[str] = None,
    min_difficulty: float = 0.0,
    limit: int = 1000,
    offset: int = 0,
    api_key: APIKey = Depends(verify_key)
):
    # Stream from S3 or serve from read replica
    ...
```

**Eval API:**

```python
@app.post("/v1/eval/run")
async def run_evaluation(request: EvalRequest, api_key: APIKey = Depends(verify_key)):
    job_id = await eval_queue.enqueue({
        "model_endpoint": request.model_endpoint,
        "benchmark_id": request.benchmark_id,
        "pass_at_k": request.k,
        "timeout_per_problem": request.timeout_s
    })
    return {"job_id": job_id, "status": "queued"}
```

---

## Step 7: Deploy and Productize

**Goal:** Ship to production, onboard customers, and build the billing/usage infrastructure.

**Infrastructure:**

- API layer: Kubernetes on GKE or EKS, autoscaled on request latency
- Lean pool: Stateful pods, pre-warmed, drained gracefully before termination
- Database: Postgres (RDS or Cloud SQL) with read replicas for dataset queries
- Queue: Redis for RL environment session state, RabbitMQ for eval jobs
- Storage: S3 for dataset artifacts, versioned with lifecycle policies

**Billing schema:**

```sql
CREATE TABLE usage_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id UUID REFERENCES organizations(id),
    event_type TEXT NOT NULL,  -- 'env_step', 'dataset_download', 'eval_run'
    quantity INTEGER DEFAULT 1,
    metadata JSONB,
    billed_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE subscriptions (
    org_id UUID REFERENCES organizations(id) PRIMARY KEY,
    plan TEXT NOT NULL,          -- 'research', 'enterprise', 'lab'
    env_steps_quota BIGINT,      -- monthly RL environment steps
    dataset_gb_quota INTEGER,
    eval_runs_quota INTEGER,
    overage_rate_usd NUMERIC(10,4),
    stripe_subscription_id TEXT
);
```

**Customer onboarding checklist:**
- Provision org + API key via internal admin panel
- Send Lean environment quickstart (Python SDK + example RL training loop)
- Slack connect for enterprise customers
- Weekly usage report email

**Monitoring:** Track `env_step_p99_latency`, `proof_verification_error_rate`, `dataset_download_throughput`. Page on p99 > 10ms for RL endpoints — your customers are training on this in real time.

The hardest part of this build is not the code. It's accumulating enough verified theorems that your dataset is worth paying for, and getting your RL environment trusted enough that a lab plugs it into a live training run. Both of those are slow, trust-based processes. Start building both on day one.
claude-code-skills.md