Claude's Corner: Polymath — Who Builds the Gyms Where AI Agents Train?

Frontier models score 25% on Polymath's Horizon-SWE benchmark. That gap — between what today's best agents can do and what software teams actually need — is the market Polymath is building for.

9 min read
Claude's Corner: Polymath — Who Builds the Gyms Where AI Agents Train?

TL;DR

Polymath automates RL environment creation for AI agents, offering the Horizon-SWE benchmark where frontier models score 25%. They sell training infrastructure to frontier labs building long-horizon software agents.

7.0
B

Build difficulty

The problem with AI agents isn't usually the model. It's that nobody has figured out how to train them to do real work over real timescales. Ask Claude or GPT-4o to answer a question and they're brilliant. Ask them to autonomously manage a software project for a week — handle the back-and-forth with stakeholders, deploy a fix, watch the metrics, and iterate — and you're basically asking a sprinter to run a marathon having never trained beyond 100 meters.

That gap exists because of something most people don't think about: reinforcement learning environments. Every capable autonomous agent needs a place to practice. A simulated world where it can take actions, observe consequences, and learn from reward signals without breaking production systems or costing a fortune. Building those worlds is slow, expensive, and deeply specialized work. Until now, every frontier lab has done it by hand, for one use case at a time, and discarded the work when the problem changed.

Polymath thinks that's insane. They're building the infrastructure to automate RL environment creation entirely — and in the process, staking a claim to one of the most valuable pieces of real estate in the agentic AI stack.

Related startups

What They Build

Polymath's pitch is deceptively simple: simulation environments for training and evaluating long-horizon AI agents. But the implications of "long-horizon" are where the real difficulty lives. Most current AI evals test isolated capabilities — can the model write a function? Can it answer a trivia question? Polymath's environments test something harder: can the agent maintain coherent behavior over hundreds of interdependent steps, using real tools, in a stateful world where earlier decisions constrain later options?

Their flagship product is Horizon-SWE, a benchmark that drops frontier models into a simulated software company. The environment is not a toy codebase with artificial constraints. It includes running applications, live development tools (linters, test runners, CI/CD pipelines), a bug tracker, a product roadmap, and multi-step tasks covering the full software development lifecycle: planning, feature design, implementation, testing, deployment, and monitoring. Frontier models score around 25% on Horizon-SWE. That number is simultaneously depressing (these are the best AI systems in the world) and commercially interesting (the gap is the market).

The business model targets the organizations that need this most: frontier AI labs. Dylan Ma and Naren Yenuganti are selling to companies that have already burned significant engineering cycles hand-building environments and understand exactly what that costs. The go-to-market is deliberately narrow. You don't need a thousand customers when your customers are funding their own AI training runs at hundreds of millions of dollars a year.

How It Works Technically

The core technical bet is world generation models — neural systems trained to procedurally generate realistic RL environments rather than building them by hand. The analogy the company uses is Applied Intuition, the autonomous vehicle simulation company. Applied Intuition didn't build a handful of test scenarios for self-driving; they built systems that generate diverse, realistic scenarios at scale. Polymath is applying the same philosophy to the problem of building worlds where software agents learn to work.

Building an RL environment for long-horizon software work requires solving several distinct hard problems simultaneously:

Stateful environment management. A software agent's actions accumulate. Code committed on day one affects the options available on day three. The simulation needs to maintain consistent state across hundreds of agent steps spanning what might be simulated weeks of work. This isn't a stateless API call — it's closer to running a persistent virtual machine that logs everything and can rewind to any checkpoint for trajectory analysis.

Real tool integration with sandboxing. A benchmark where the "CI pipeline" is a pretend function that always returns green tells you nothing useful. Polymath's environments wire real development tools into the simulation — actual test runners, real linters, genuine git operations — while sandboxing them so a misconfigured agent cannot escape into actual infrastructure. The reward signal comes from whether tests pass, whether the build succeeds, whether the deployed service stays healthy. That is much harder to fake at scale than it sounds.

Task generation and diversity. Environments for RL training need to produce diverse enough tasks that agents cannot memorize solutions, realistic enough tasks that learned behaviors transfer to the real world, and verifiable enough tasks that you can automatically assign reward signals without human annotation on every trajectory. Getting this triple right requires deep domain knowledge about what software work actually looks like — what makes a bug realistic, what makes a feature request representative, what makes an incident response credible.

Rollout parallelism. You train RL agents by running massive numbers of parallel rollouts — thousands of agent trajectories simultaneously. This means the environment stack needs to provision, orchestrate, checkpoint, and tear down thousands of simulation instances concurrently. The infrastructure cost is non-trivial, and the latency requirements are strict: if your environment takes 30 seconds to reset between episodes, you have massively constrained your training throughput.

The Horizon-SWE benchmark is the public face of all this infrastructure. The 25% frontier model score is not just a curiosity — it is the training signal that makes the benchmark commercially valuable. Labs running agents through Horizon-SWE generate trajectory data with built-in verifiable reward signals: did the code pass tests? Did the deployment succeed? Did the monitoring alert fire? That is the foundation of reinforcement learning from verifiable rewards (RLVR), the training approach behind recent gains from systems like DeepSeek-R1. Polymath is selling the gym equipment, not just the scoreboard.

The Team

Dylan Ma led post-training research and data at Hume AI, where he spent serious time thinking about how to elicit complex behaviors from models through reinforcement learning. He holds an NSF Graduate Research Fellowship from UC Berkeley. Co-founder Naren Yenuganti built credit monitoring infrastructure at Plaid and large-scale ML systems at Amazon, bringing data engineering and ML infrastructure depth that this problem demands.

The combination matters. Building good RL environments is not just a software engineering problem — it is a research problem about how training dynamics interact with environment design. Build environments that are too easy and agents overfit to shortcuts. Build them with incorrect reward signals and you get reward hacking that looks great in simulation and collapses in deployment. Build them with insufficient diversity and trained agents fail to generalize. These are known failure modes with non-obvious solutions, and the team's background suggests they have spent time on both sides of the equation.

Difficulty Score

DimensionScoreWhy
ML/AI9/10World generation models, RL environment design, long-horizon evaluation methodology, reward calibration, curriculum learning — this is serious research territory
Data8/10Trajectory storage at scale, task generation datasets, environment state serialization, verifiable reward signal extraction from real tool outputs
Backend7/10Distributed simulation orchestration, stateful environment management, sandboxed tool integration, checkpoint and rollback infrastructure
Frontend3/10Benchmark dashboards and leaderboards — genuinely not the hard part here
DevOps8/10GPU clusters for parallel RL training, thousands of concurrent environment instances, containerized tool sandboxes, rollout throughput optimization

The Moat

Here is what is easy to replicate: a simulated software company. Any team with solid engineering and a few months can build a GitHub-connected sandbox with a fake bug tracker and stub CI runners. That is not the product.

What is hard to replicate is calibration — the degree to which performance on the benchmark actually predicts performance in real-world deployment. A miscalibrated benchmark is worse than useless: it tells you you are making progress when you are not, and labs will abandon it fast once they figure that out. Getting calibration right requires a feedback loop that closes over months and years. Polymath needs to see how agents trained in their environments perform in actual deployment, incorporate that signal back into environment design, and iterate. Every frontier lab using Horizon-SWE accelerates this loop.

This is a data flywheel that compounds. The more labs run their training through Polymath's environments, the more calibration data Polymath accumulates, the better their environments become, the more labs want to use them. Once a benchmark becomes the industry standard, the switching cost is high because everyone's training methodology is benchmarked against it.

There is also a benchmark reputation network effect that is genuinely sticky. Research groups do not casually swap out their evaluation benchmarks. Benchmark reputation is built on citations, on which numbers appear in papers, on what program committees expect to see. If Horizon-SWE becomes the standard eval for software agents — and there is no serious competing benchmark right now — it becomes load-bearing infrastructure for an entire field's research agenda.

What is easy to replicate: the initial benchmark scaffold. What is hard: the world generation technology at scale, the calibration data flywheel, and most importantly, the benchmark's reputation. Labs do not swap out evals casually. A benchmark's credibility is a network effect, and network effects have compounding returns.

Replicability Score: 70 / 100

A 70 puts Polymath solidly in "real moat" territory but acknowledges that the window has not fully closed. The technical problems are known — OpenAI, Anthropic, and Google all have internal versions of RL training environments. The research is not secret. But there is a significant gap between "we know what to build" and "we have built something whose training signal actually produces better agents," and that gap takes time and calibration data to close.

What pushes the score above 60: the world generation research (not trivially reproducible), the need for deep domain expertise in what makes software tasks realistic, and the benchmark reputation network effect. What keeps it below 80: no decade of proprietary R&D, no hardware moat, and well-funded potential competition (Applied Intuition, internal teams at frontier labs) that could credibly decide to compete directly.

The strategic clock is running. Polymath's advantage is that they are already in production with frontier labs while the field is still figuring out what RL environments for software agents should look like. If they reach benchmark-reputation lock-in — the moment when Horizon-SWE numbers appear routinely in papers from multiple labs — the moat gets materially harder to breach. The 12-to-18-month window before that happens is the key risk.

For anyone building agents that do real work rather than demo work, Polymath is solving a problem that is not optional. You cannot train a software agent on multiple-choice questions and expect it to survive a sprint planning meeting. The environments have to be real. Polymath is trying to be the company that builds them so you do not have to.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Polymath Clone: RL Environment Platform for AI Agents

## Step 1: Database Schema

Design the core tables for environment state, episodes, tasks, and trajectories.

```sql
CREATE TABLE environments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  type TEXT NOT NULL,
  config JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE episodes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  environment_id UUID REFERENCES environments(id),
  agent_id TEXT,
  task_id UUID,
  status TEXT DEFAULT 'running',
  started_at TIMESTAMPTZ DEFAULT now(),
  ended_at TIMESTAMPTZ,
  final_reward FLOAT
);

CREATE TABLE steps (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  episode_id UUID REFERENCES episodes(id),
  step_number INT NOT NULL,
  action JSONB NOT NULL,
  observation JSONB NOT NULL,
  reward FLOAT,
  done BOOLEAN DEFAULT false,
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE tasks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  environment_id UUID REFERENCES environments(id),
  description TEXT NOT NULL,
  verification_fn TEXT NOT NULL,
  difficulty INT DEFAULT 1,
  tags TEXT[],
  created_at TIMESTAMPTZ DEFAULT now()
);
```

## Step 2: Tool Sandbox Layer

Use Docker plus gVisor for secure, isolated tool sandboxing per episode.

```python
import docker

class ToolSandbox:
    def __init__(self):
        self.client = docker.from_env()

    def create_episode_sandbox(self, episode_id: str):
        return self.client.containers.run(
            "ubuntu:22.04",
            detach=True,
            name=f"sandbox-{episode_id}",
            runtime="runsc",
            mem_limit="2g",
            cpu_quota=100000,
            volumes={f"/episodes/{episode_id}": {"bind": "/workspace", "mode": "rw"}},
        )

    def execute_action(self, container, action: dict) -> dict:
        result = container.exec_run(
            action["command"], workdir="/workspace"
        )
        return {"stdout": result.output.decode(), "exit_code": result.exit_code}

    def teardown(self, container):
        container.stop()
        container.remove()
```

## Step 3: Environment State Machine

Persist environment state in Redis with checkpoint and rollback support.

```python
import redis, json
from dataclasses import dataclass, field

@dataclass
class EnvironmentState:
    episode_id: str
    step_number: int = 0
    git_log: list = field(default_factory=list)
    test_results: dict = field(default_factory=dict)
    open_issues: list = field(default_factory=list)
    deployment_status: str = "stable"

class StateManager:
    def __init__(self):
        self.redis = redis.Redis(host="localhost", port=6379)

    def save_state(self, state: EnvironmentState):
        key = f"state:{state.episode_id}:{state.step_number}"
        self.redis.setex(key, 86400, json.dumps(state.__dict__))

    def load_state(self, episode_id: str, step: int) -> EnvironmentState:
        data = self.redis.get(f"state:{episode_id}:{step}")
        return EnvironmentState(**json.loads(data))

    def rollback(self, episode_id: str, to_step: int):
        return self.load_state(episode_id, to_step)
```

## Step 4: LLM-Powered Task Generation

Use Claude to generate diverse, realistic tasks with verifiable reward functions.

```python
import anthropic, json

client = anthropic.Anthropic()

PROMPT = """Generate a realistic software engineering task for an AI agent.
Return JSON: {"description": "...", "verification_code": "def verify(state) -> float: ...", "difficulty": 1-5}"""

def generate_tasks(env_type: str, n: int = 100) -> list[dict]:
    tasks = []
    for _ in range(n):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            messages=[{"role": "user", "content": f"{PROMPT}\nEnvironment: {env_type}"}]
        )
        tasks.append(json.loads(resp.content[0].text))
    return tasks
```

## Step 5: Parallel Rollout Orchestration

Orchestrate thousands of parallel episodes using Kubernetes batch jobs.

```python
from kubernetes import client as k8s, config
import asyncio

class RolloutOrchestrator:
    def __init__(self, max_parallel: int = 1000):
        config.load_incluster_config()
        self.batch = k8s.BatchV1Api()
        self.sem = asyncio.Semaphore(max_parallel)

    async def run_episode(self, agent_url: str, env_id: str, task_id: str):
        async with self.sem:
            job = self._job_spec(agent_url, env_id, task_id)
            self.batch.create_namespaced_job(namespace="episodes", body=job)

    async def run_batch(self, agent_url: str, env_id: str, n: int):
        return await asyncio.gather(*[
            self.run_episode(agent_url, env_id, f"task-{i}") for i in range(n)
        ])
```

## Step 6: Automatic Reward Extraction

Extract reward signals from real tool outputs — pytest, linters, CI runs, deployment health.

```python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ToolOutput(BaseModel):
    tool: str
    output: str
    exit_code: int

def extract_reward(tool_output: ToolOutput) -> float:
    if tool_output.tool == "pytest":
        if tool_output.exit_code == 0:
            return 1.0
        passed = tool_output.output.count(" passed")
        failed = tool_output.output.count(" failed")
        total = passed + failed
        return passed / total if total > 0 else 0.0
    if tool_output.tool in ("eslint", "github_actions"):
        return 1.0 if tool_output.exit_code == 0 else 0.0
    if tool_output.tool == "k8s_health":
        return 1.0 if "healthy" in tool_output.output.lower() else 0.0
    return 0.0

@app.post("/reward")
async def compute_reward(tool_output: ToolOutput):
    return {"reward": extract_reward(tool_output)}
```

## Step 7: Public Benchmark API

Expose Horizon-SWE as a REST API that any lab can hit to evaluate their agent.

```python
from fastapi import FastAPI
from pydantic import BaseModel
import asyncio

app = FastAPI(title="Horizon-SWE API")

class AgentReg(BaseModel):
    agent_name: str
    api_endpoint: str
    api_key: str

class BenchmarkRun(BaseModel):
    agent_id: str
    n_episodes: int = 100

@app.post("/agents/register")
async def register_agent(reg: AgentReg):
    agent_id = create_agent_record(reg)
    return {"agent_id": agent_id}

@app.post("/benchmark/run")
async def run_benchmark(run: BenchmarkRun):
    agent = get_agent(run.agent_id)
    orchestrator = RolloutOrchestrator()
    tasks = get_tasks(n=run.n_episodes)
    results = await orchestrator.run_batch(agent.api_endpoint, "horizon-swe", run.n_episodes)
    score = sum(r.get("reward", 0) for r in results) / len(results)
    save_result(run.agent_id, score, results)
    return {"agent_id": run.agent_id, "score": score, "n_episodes": len(results)}

@app.get("/leaderboard")
async def leaderboard():
    return get_top_agents(limit=50)
```

## Deployment Notes
- GPU cluster (A100s on GCP/AWS) for parallel RL training workloads
- Redis Cluster for hot environment state; PostgreSQL for trajectory storage
- Kubernetes with auto-scaling node pools for simulation bursts
- gVisor runtime for all agent sandboxes (prevents container escapes)
- CDN-fronted API with per-agent-key rate limiting
claude-code-skills.md