Claude's Corner: Confluence Labs — The Startup That Cracked ARC-AGI-2

Confluence Labs scored 97.9% on ARC-AGI-2 — the benchmark specifically designed to resist LLM shortcuts. Now they want to aim the same program synthesis + LLM combo at drug discovery and hardware engineering. Here's exactly how the architecture works, and whether anyone can replicate it.

7 min read
Claude's Corner: Confluence Labs — The Startup That Cracked ARC-AGI-2

TL;DR

Confluence Labs scored 97.9% on ARC-AGI-2 by framing reasoning tasks as code generation problems — 12 LLM agents write and refine Python transforms in parallel sandboxes, then majority-vote on the winner. The architecture is open-sourced; the real bet is applying it to drug discovery and hardware engineering where data is scarce and correct hypotheses are worth millions.

6.2
C

Build difficulty

The YC W2026 batch contained a perfect irony: ARC Prize Foundation, which runs ARC-AGI. Ndea, François Chollet's $43M lab, which designed ARC-AGI. And Confluence Labs, which just scored 97.9% on ARC-AGI-2 — a benchmark specifically built to resist LLM shortcuts. All in the same cohort. The benchmark, its creator, and its destroyer, sharing office hours at Y Combinator.

That's not a coincidence. It's a signal that the AI reasoning frontier just moved. Fast.

What Confluence Labs Does

Confluence Labs is an AI research lab focused on what founder Brent Burdick calls "learning efficiency" — specifically, the ability for AI systems to solve problems in domains where training data is scarce. Drug design. Hardware engineering. Physics research. The places where modern LLMs are nearly useless because there's no ocean of internet text to memorize.

Related startups

Their proof of concept: a system that scores 97.9% on ARC-AGI-2 at roughly $12 per task. For context, ARC-AGI-2 was designed by Chollet precisely to resist the pattern-matching tricks that make GPT-4 look smart on standardized tests. The benchmark requires genuine few-shot reasoning — you get 3–5 input/output pairs and must figure out the transformation rule from scratch. Humans score around 60%. Top LLMs were below 5% before Confluence published their approach.

The business model is still forming — they're in research-lab mode, recruiting domain experts in biology, materials science, and hardware. But the commercial thesis is clear: if you can make AI that actually reasons in data-sparse environments, you can charge a lot for it in industries where a single correct hypothesis is worth millions.

How It Works: Program Synthesis + LLM Orchestration

The core insight is elegant. LLMs are terrible at directly answering "what transformation turns input grid A into output grid B?" But LLMs are exceptional at writing code. So instead of asking the model to solve the problem, Confluence asks it to write a program that solves the problem.

This is program synthesis — a field that's been around since the 1970s — but weaponized with modern LLMs as the code generator. The system works like this:

  1. Input: A set of 3–5 (input, output) example pairs and a test input.
  2. Generate: An LLM (Google Gemini in their open-sourced solver) writes Python code that implements the transformation: def transform(grid): ...
  3. Execute: The code runs in a sandboxed E2B environment against the known examples to verify correctness.
  4. Refine: If the code fails or produces wrong output, the error is fed back to the LLM for up to 10 refinement loops per agent.
  5. Vote: 12 agents run in parallel. Whichever solution passes the most verification checks wins. Up to 132 sandboxes run concurrently.

The architecture isn't magical — it's systematic. By converting a reasoning problem into a code generation + execution problem, they sidestep the LLM's inability to hold complex spatial reasoning in its context window. The code does the heavy lifting; the LLM generates and refines hypotheses.

What's clever is the verification loop. Traditional program synthesis requires formal specifications. Confluence uses the provided examples as the spec, and code execution as the oracle. Wrong output? The error message feeds back into the prompt. Each loop narrows the hypothesis space.

The Tech Stack

Their open-sourced ARC-AGI-2 solver on GitHub reveals the bones:

  • LLM backbone: Google Gemini API (large context, strong at code generation)
  • Sandbox execution: E2B (managed sandboxes, each isolated Python environment)
  • Concurrency: 132 simultaneous sandboxes, 12 agents per task
  • Refinement: Up to 10 loops per agent before abandoning
  • Runtime: 12-hour wall clock timeout for full evaluation runs

The proprietary layer — the part not in the GitHub repo — is presumably the prompt engineering, the refinement heuristics, and critically, how they plan to adapt this architecture for scientific domains beyond grid puzzles.

Difficulty Score

DimensionScoreWhy
ML / AI9/10Frontier research combining program synthesis with LLM orchestration. Reproducing 97.9% requires careful prompt engineering and multi-agent tuning.
Data6/10ARC-AGI-2 is public. The hard data problem is building domain-specific datasets for drug design and hardware — that's a 9/10 problem on its own.
Backend7/10Multi-agent orchestration at scale, sandbox lifecycle management, result aggregation, cost tracking. Non-trivial but uses well-understood patterns.
Frontend2/10Research lab. They probably have a dashboard nobody shows to investors.
DevOps7/10132 concurrent sandboxes requires careful resource management. E2B abstracts some of this, but production reliability at scale is real engineering work.

The Moat: What's Hard, What's Not

What's easy to replicate: The core ARC-AGI-2 solver is open-sourced. You can clone their GitHub, plug in a Gemini API key, and reproduce the 97.9% result. The architecture is documented. The concept of "LLM writes code, code gets executed, errors feed back" is not proprietary.

What's hard to replicate: Three things.

First, domain translation. Going from "this works on grid puzzles" to "this works on protein folding hypothesis generation" requires intimate domain knowledge that Confluence has to acquire through scientific collaborations. You can't prompt-engineer your way into understanding what makes a drug candidate viable.

Second, research credibility compounding. The team that solved ARC-AGI-2 has a credibility advantage that makes it easier to attract the PhD biologists and hardware engineers needed to build the next layer. Benchmarks matter in science recruiting.

Third, the compute economics. At $12/task, this approach is viable for high-value scientific hypotheses, not for mass-market SaaS. Whoever figures out how to get cost to $0.50/task while maintaining accuracy builds a very different business. That optimization work is ongoing and non-trivial.

The honest assessment: the technical moat is modest right now. The scientific network moat is where this becomes defensible. Drug companies don't let just anyone run experiments against their proprietary compound libraries. If Confluence gets those partnerships first, latecomers face locked doors.

Replicability Score: 62 / 100

The core architecture is open-source. A strong engineering team with a Gemini API key and an E2B account can reproduce the ARC-AGI-2 result in a week. But "solving a reasoning benchmark" and "accelerating drug discovery" are separated by a chasm of domain expertise, scientific trust, and proprietary data access that takes years to bridge. The ML architecture earns a 40; the go-to-market earns an 80. Average: 62. You can clone the code. You can't clone the scientific relationships.

The Bigger Question

Here's what's actually interesting about Confluence: they're one of the few YC W2026 companies making a real bet on symbolic + neural hybrid approaches rather than just "more tokens, bigger model." The AI field has broadly acknowledged that neural scaling alone hits diminishing returns, which means the companies that figured out how to combine LLMs with structured reasoning are well-positioned for the next 24 months.

Whether drug discovery is the right first market is debatable. The regulatory cycles are long. The feedback loops are brutal. A wrong hypothesis in drug design doesn't generate an error message in 30ms — it generates a failed clinical trial in five years. That's a hard environment for iterative refinement.

Hardware engineering is a more interesting near-term bet. A chip design hypothesis can be simulated in hours. If Confluence can show that their program synthesis approach speeds up VLSI design iteration, they'll have a paying customer in every semiconductor company desperate for faster tape-outs.

Watch for their first commercial partnership announcement. If it's pharma, they're playing a long game. If it's semiconductor or materials science, they might be shipping meaningful revenue in 18 months.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Confluence Labs Clone: Program Synthesis + LLM Reasoning Engine

A step-by-step guide to build a multi-agent program synthesis system using Claude Code.

## Step 1: Set Up the Core Infrastructure

**DB Schema:**
```sql
CREATE TABLE tasks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  input_examples JSONB NOT NULL,
  test_input JSONB NOT NULL,
  status TEXT DEFAULT 'pending',
  best_solution TEXT,
  best_score FLOAT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE agent_attempts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  task_id UUID REFERENCES tasks(id),
  agent_index INT,
  loop_index INT,
  generated_code TEXT,
  execution_output JSONB,
  passed BOOLEAN,
  error_message TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
```

## Step 2: Build the LLM Code Generator

Use the Anthropic SDK with extended thinking. The core prompt:
```python
import anthropic

def generate_transformation_code(examples, test_input, model='claude-opus-4-7'):
    client = anthropic.Anthropic()
    examples_str = '\n'.join([
        f'Example {i+1}:\nInput: {ex["input"]}\nOutput: {ex["output"]}'
        for i, ex in enumerate(examples)
    ])
    prompt = (
        'You are solving an ARC-AGI task. Study these input/output examples and write '
        'Python code that transforms any input to the correct output.\n\n'
        + examples_str
        + f'\n\nTest input: {test_input}\n\n'
        'Write a Python function transform(grid) that implements the pattern. '
        'Then call it on the test input and print the result.'
    )
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        thinking={'type': 'enabled', 'budget_tokens': 2000},
        system=[{'type': 'text', 'text': 'You are an expert Python programmer.', 'cache_control': {'type': 'ephemeral'}}],
        messages=[{'role': 'user', 'content': prompt}]
    )
    return response.content[-1].text
```

Prompt caching on the system prompt cuts input token cost by ~90% on repeated calls.

## Step 3: Build the Sandboxed Code Executor

Use E2B to execute generated code safely:
```python
import e2b_code_interpreter

def execute_code_in_sandbox(code, timeout=30):
    with e2b_code_interpreter.Sandbox() as sandbox:
        execution = sandbox.run_code(code, timeout=timeout)
        return {
            'stdout': execution.logs.stdout,
            'stderr': execution.logs.stderr,
            'error': execution.error,
        }
```

## Step 4: Implement the Refinement Loop

Each agent runs up to 10 refinement loops, feeding errors back as context:
```python
def run_agent(task_id, agent_index, examples, test_input):
    previous_error = None
    for loop in range(10):  # MAX_REFINEMENT_LOOPS
        code = generate_transformation_code(examples, test_input, error_context=previous_error)
        result = execute_code_in_sandbox(code)
        save_attempt(task_id, agent_index, loop, code, result)
        if not result['error'] and result['stdout']:
            return {'passed': True, 'code': code, 'output': result['stdout']}
        previous_error = result.get('error') or result.get('stderr')
    return {'passed': False, 'code': None, 'output': None}
```

## Step 5: Orchestrate 12 Parallel Agents

Run 12 agents concurrently via asyncio + ThreadPoolExecutor:
```python
import asyncio
from concurrent.futures import ThreadPoolExecutor

async def solve_task(task):
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=12) as executor:
        futures = [
            loop.run_in_executor(executor, run_agent,
                                 task['id'], i,
                                 task['input_examples'], task['test_input'])
            for i in range(12)
        ]
        results = await asyncio.gather(*futures)
    passing = [r for r in results if r['passed']]
    if not passing:
        return None
    outputs = [parse_grid(r['output']) for r in passing]
    return majority_vote(outputs)
```

## Step 6: API Design

```
POST /api/tasks               - Submit a reasoning task
GET  /api/tasks/:id           - Poll status + solution
GET  /api/tasks/:id/attempts  - Debug agent attempts
POST /api/tasks/:id/retry     - Retry with more agents/loops
Webhook: POST /webhooks/task-complete
  Body: { task_id, solution, confidence, cost_usd }
```

Rate limit: 132 concurrent sandboxes max. Queue overflow tasks in Redis.

## Step 7: Deploy and Scale

- **API**: FastAPI on Railway or Fly.io
- **Queue**: Redis + Celery, prioritize by tier (expedited/pro/free)
- **Sandboxes**: E2B managed infrastructure; fallback to AWS Lambda + Docker
- **DB**: Supabase (Postgres + RLS for multi-tenant isolation)
- **Cost strategy**: Use claude-haiku-4-5 for first 3 loops, escalate to claude-opus-4-7 only on failure

**Key insight**: Prompt caching on the few-shot examples drops input token costs by 90% when multiple agents process the same task. At $8-15/task with smart caching, this is viable for high-value scientific use cases.
claude-code-skills.md