Claude's Corner: Synthetic Sciences — AI Co-Scientists Running Research End-to-End

Synthetic Sciences (YC W2026) built an AI platform that runs the full research loop — literature reviews, GPU training, experiment analysis, and LaTeX paper drafts — while scientists sleep. Here's what they built, how it works, and whether you can replicate it.

8 min read
Claude's Corner: Synthetic Sciences — AI Co-Scientists Running Research End-to-End
6.4
C

Build difficulty

Science moves at the speed of bureaucracy. A PhD student spends 80% of their time not doing science — they're reading papers, wrangling GPU clusters, reformatting citations, and staring at error logs at 2am. Synthetic Sciences looked at that and decided the solution isn't a better literature search tool. It's an AI that does the whole job while you sleep.

That's the pitch from Aayam and Ishaan Gangwani, two ML researchers who met through a circuit of NeurIPS, ICML, and AAAI workshops before deciding that the actual bottleneck in science isn't funding or talent — it's bandwidth. If you can deploy a swarm of AI co-scientists that own the full loop from literature review to LaTeX draft, you don't just make researchers faster. You change who gets to do science at all.

What They Build

Synthetic Sciences is an agentic research platform with four operating modes: Research, Biology, Flywheel, and Write. Each mode isn't just a prompt template — it's a distinct workflow engine with purpose-built tooling.

Research mode covers the full ML research cycle: literature synthesis grounded in your specific project context, hypothesis trees, experiment design, Python and R code execution, GPU job dispatch to serverless compute, run monitoring, results analysis, and publication-ready output. You point it at a question or a dataset and go to bed.

Biology mode extends this into wet-lab and computational biology — protein design workflows, genomics analysis pipelines, the stuff that used to require specialized bioinformatics expertise and a grad student with a peculiar tolerance for pain. On the BixBench Verified benchmark for computational biology research automation, they're at 92% accuracy. That's not a demo number — that's a number that makes biology PhDs uncomfortable.

Flywheel mode is where it gets philosophically interesting. The agent auto-designs fine-tuning runs using your feedback as training data. Every correction you make to a hypothesis, every experiment result you annotate, feeds back into model improvement. Your research group's behavior becomes training signal. This is how they get better without you noticing you're helping them get better.

Write mode turns rough notes into structured arguments with verified citations and clean LaTeX. Not AI slop — actual academic prose that knows what it's citing and why.

Related startups

Target Customer and Business Model

They're going after individual researchers and research teams before trying to eat the enterprise. Pricing is $50/month for Plus (50 credits), $200/month for Pro (200 credits, priority GPU access), and custom enterprise with on-premise deployment options. Credits abstract over compute — a literature review of 200 papers costs a few credits; running a multi-GPU training job costs more.

The go-to-market is bottom-up, same playbook Notion used for teams and Figma used for design. Individual researchers expense it, teams standardize on it, universities buy enterprise deals. The vector into enterprise is the Flywheel — once your research group's custom models are baking in their training pipeline, switching costs spike.

How It Works

Under the hood, Synthetic Sciences is an orchestration layer over commodity LLMs plus a carefully engineered execution environment. The architectural bets that matter:

Persistent sandboxes with checkpointing. Each agent session runs in an isolated containerized environment with full state serialization. The agent can start a 6-hour training run, and you can resume it mid-flight from another session, on another device, after a credential rotation. "Deterministic resume" is the key phrase — the environment state is hash-checked against the checkpoint so you don't get ghost runs that silently diverged. This is genuinely hard to get right and most research automation tools don't bother.

Credential synchronization across the scientific toolchain. GitHub, HuggingFace, Weights & Biases, Modal, and 20+ compute providers. Credential sync sounds boring until you've spent three hours debugging why your training job can't read from your HuggingFace private repo because the agent spawned with stale tokens. They've built a secrets management layer that refreshes credentials transparently across agent sessions.

Elastic GPU orchestration. Not tied to a single compute provider. The agent picks the cheapest available A100 across their broker network, submits the job, monitors it, and retries on failure. For researchers who are used to babysitting SLURM jobs, this feels like magic. Practically, it's a thin abstraction over Modal, RunPod, and similar GPU clouds with a scheduler that understands that a training job for a 7B model has different queue dynamics than a small fine-tune.

The literature layer. They're not just doing RAG over arXiv. The literature synthesis is grounded in the user's specific project context — it knows what you've already read, what hypotheses you've ruled out, what datasets you're working with. The citation graph is maintained as structured data, not embedded blobs, which means the agent can do things like "find papers that cite both X and Y but were published after their method was published" — queries that are trivial logically but nearly impossible with standard vector search.

The thesis flywheel. This is the long play and the reason the founders have an ML research background. Their thesis: to build AI scientists that are actually good, you need process data — not just outcomes, but the full trace of how a researcher moves through a problem. Every interaction on their platform generates that data. In two years, they'll have a dataset of research processes that nobody else has. That's when the fine-tuned models get meaningfully better than what you can build on top of GPT-4.

Difficulty Score

DimensionScoreWhy
ML / AI8/10LLM orchestration is table stakes; the hard part is fine-tuning research-specific models and building the flywheel training infrastructure
Data7/10Scientific corpus ingestion + process data collection + structured citation graphs; the data moat builds over time
Backend7/10Async job orchestration, persistent sandbox management, multi-provider credential sync — not trivial engineering
Frontend5/10Research-focused UI is a genuine design challenge but not technically nightmarish
DevOps8/1020+ compute providers, containerized sandboxes at scale, deterministic checkpointing under concurrent loads

The Moat

The easy stuff to copy: the LLM orchestration framework, the four-mode structure, the GPU job dispatch, even the literature RAG. Any competent team with six months could build a version of this that looks similar in a demo.

The hard stuff: the process data flywheel. Every research session logged is a training example that competitors can't replicate without users. The longer Synthetic Sciences runs, the better their models get relative to a clone built on vanilla GPT-4o. There's also a workflow lock-in dynamic: once a research group's experiment history, custom models, and annotation conventions are baked into their Flywheel instance, migration means abandoning that institutional memory. That's a real switching cost.

The biological mode is also a meaningful bet — computational biology is eating wet-lab biology, and the 92% BixBench figure gives them a credibility anchor in a community that is deeply skeptical of AI hype. Trust is a moat in scientific communities in a way it isn't in enterprise SaaS.

What's genuinely worrying for them: Microsoft and Google both have research automation ambitions. These companies can distribute their tools for free through institutional relationships that Synthetic Sciences can only dream about in 2026. The race isn't to build the best product — it's to get deep enough into enough research groups that displacing them has a real cost before the big labs decide to take this seriously.

Replicability Score: 52 / 100

The core tech stack is available to any funded team: LLM orchestration (LangGraph or custom), containerized agent runtimes (E2B, Daytona), GPU brokering (Modal, RunPod), structured literature search (Semantic Scholar API + vector DB), LaTeX generation (any LLM with a good prompt). A strong team could build a functional clone in 4–6 months.

What keeps this from being lower: the flywheel data moat is real but early. Right now they don't have enough process data for it to be a meaningful differentiator — that gap grows over 18–24 months. The BixBench performance was achieved with careful prompt engineering and domain-specific tooling, not proprietary models, so it's reproducible with effort. The credential sync and deterministic checkpoint system is annoying to build but not exotic.

What keeps this from being higher: the category is genuinely early-stage. "AI for scientific research" doesn't have an entrenched incumbent the way legal AI or medical AI does. The founders have real research credibility (NeurIPS/ICML publications) that matters in scientist communities. And the institutional sale to universities is slower and stickier than enterprise SaaS — once a lab standardizes on a tool, they don't switch until a grad student graduates and takes the muscle memory with them.

This is a startup that is winning on vision and founder-market fit more than on a technology that's hard to copy today. The bet is that by the time it is easy to copy, they'll have the data and the institutional relationships to make it not matter.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Synthetic Sciences Clone with Claude Code
## Step-by-step guide to building an AI research automation platform

### Step 1: Database Schema & Project Scaffold

```sql
-- Core tables
CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email TEXT UNIQUE NOT NULL,
  credits INTEGER DEFAULT 50,
  tier TEXT DEFAULT 'plus',
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE projects (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID REFERENCES users(id),
  name TEXT NOT NULL,
  description TEXT,
  context JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE agent_sessions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id),
  mode TEXT NOT NULL,
  status TEXT DEFAULT 'pending',
  checkpoint_hash TEXT,
  sandbox_id TEXT,
  messages JSONB DEFAULT '[]',
  cost_credits NUMERIC DEFAULT 0,
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE credentials (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID REFERENCES users(id),
  provider TEXT NOT NULL,
  encrypted_token TEXT NOT NULL,
  expires_at TIMESTAMPTZ,
  updated_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE literature_items (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id),
  arxiv_id TEXT,
  title TEXT,
  abstract TEXT,
  authors JSONB,
  year INTEGER,
  citation_edges JSONB DEFAULT '[]',
  embedding VECTOR(1536),
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE gpu_jobs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  session_id UUID REFERENCES agent_sessions(id),
  provider TEXT,
  external_job_id TEXT,
  status TEXT DEFAULT 'queued',
  config JSONB,
  result_path TEXT,
  created_at TIMESTAMPTZ DEFAULT now(),
  completed_at TIMESTAMPTZ
);

CREATE EXTENSION IF NOT EXISTS vector;
CREATE INDEX ON literature_items USING ivfflat (embedding vector_cosine_ops);
```

```bash
npx create-next-app@latest synthetic-sciences-clone --typescript --tailwind --app
cd synthetic-sciences-clone
npm install @anthropic-ai/sdk @e2b/code-interpreter langchain @langchain/community
npm install modal runpod-sdk @wandb/sdk semantic-scholar-api
npm install @supabase/supabase-js pgvector bull ioredis
```

### Step 2: Agent Orchestration Core

Build a multi-step agent loop using the Anthropic SDK with tool use.

```typescript
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

export async function runResearchAgent(sessionId: string, userMessage: string, projectContext: ProjectContext) {
  const messages: Anthropic.MessageParam[] = [
    { role: 'user', content: buildSystemContext(projectContext) + '\n\n' + userMessage }
  ];

  while (true) {
    const response = await client.messages.create({
      model: 'claude-opus-4-7',
      max_tokens: 8096,
      tools: researchTools,
      messages
    });

    await saveCheckpoint(sessionId, messages, response);
    if (response.stop_reason === 'end_turn') break;

    if (response.stop_reason === 'tool_use') {
      const toolResults = await executeTools(response.content, sessionId);
      messages.push({ role: 'assistant', content: response.content });
      messages.push({ role: 'user', content: toolResults });
    }
  }

  return response;
}
```

Tools: search_literature, run_code, dispatch_gpu_job, write_latex — each maps to a real external service call.

### Step 3: Persistent Sandbox with Checkpointing

```typescript
import { Sandbox } from '@e2b/code-interpreter';

export class PersistentSandbox {
  async restore() {
    const { data } = await supabase
      .from('agent_sessions').select('sandbox_id').eq('id', this.sessionId).single();

    if (data?.sandbox_id) {
      try { this.sandbox = await Sandbox.reconnect(data.sandbox_id); return true; }
      catch { /* sandbox expired */ }
    }

    this.sandbox = await Sandbox.create({ timeoutMs: 3_600_000 });
    await this.saveRef();
    return false;
  }

  async execute(code: string) {
    if (!this.sandbox) await this.restore();
    const result = await this.sandbox!.runCode(code);
    await this.checkpoint();
    return result;
  }
}
```

Key: save sandbox ID to DB after each execution. On resume, reconnect by ID before creating a new sandbox.

### Step 4: Literature Layer with Structured Citation Graphs

Don't use naive RAG. Store citation edges as structured JSONB for relational queries.

```typescript
export async function ingestPaper(projectId: string, arxivId: string) {
  const paper = await semanticScholar.paper(`arXiv:${arxivId}`, {
    fields: ['title', 'abstract', 'authors', 'year', 'references', 'citations']
  });
  const embedding = await embedText(paper.abstract);
  const citationEdges = [
    ...paper.references.map(r => ({ type: 'cites', paper_id: r.paperId })),
    ...paper.citations.map(c => ({ type: 'cited_by', paper_id: c.paperId }))
  ];
  await supabase.from('literature_items').upsert({
    project_id: projectId, arxiv_id: arxivId,
    title: paper.title, abstract: paper.abstract,
    authors: paper.authors, year: paper.year,
    citation_edges: citationEdges, embedding
  });
}
```

Add a Postgres function `query_citation_intersection` for relational queries over the citation graph.

### Step 5: GPU Job Dispatch

Route jobs to Modal or RunPod based on availability and model size.

```typescript
export async function dispatchJob(sessionId: string, config: JobConfig, provider = 'modal') {
  let jobId: string;
  if (provider === 'modal') {
    const fn = modal.Function.lookup('training-runner', 'train_model');
    const call = await fn.spawn(config);
    jobId = call.object_id;
  } else {
    const job = await runpod.runPod({ input: config });
    jobId = job.id;
  }
  await supabase.from('gpu_jobs').insert({ session_id: sessionId, provider, external_job_id: jobId, config });
  pollJobStatus(jobId, provider, sessionId);
  return jobId;
}
```

Poll every 15s. On completion, call `resumeSession()` to wake the agent with the training results.

### Step 6: Flywheel Data Collection

```typescript
export async function logProcessExample(example: ProcessExample) {
  await supabase.from('flywheel_examples').insert(example);
}

export async function prepareFinetuneDataset(userId: string) {
  const { data: examples } = await supabase
    .from('flywheel_examples').select('*')
    .eq('human_feedback', 'accepted').limit(500);

  const dataset = examples.map(ex => ({
    messages: [...ex.messages, { role: 'assistant', content: ex.agent_output }]
  }));

  await uploadToHuggingFace(userId, dataset);
  return dataset.length;
}
```

Log every agent turn. Nightly job filters `human_feedback = 'accepted'` and formats for fine-tuning.

### Step 7: Deployment

- **Frontend + API:** Vercel (Next.js App Router)
- **Database:** Supabase (Postgres + pgvector + auth)
- **Job worker:** Railway (Bull queue processor, separate service)
- **Queue:** Upstash Redis
- **Sandboxes:** E2B
- **GPU dispatch:** Modal (Python SDK, wrap in Next.js API route)
- **Billing:** Stripe (credit packs — 50 credits/$50, metered usage)

Key env vars: `ANTHROPIC_API_KEY`, `E2B_API_KEY`, `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, `RUNPOD_API_KEY`, `SEMANTIC_SCHOLAR_API_KEY`

**Build timeline:** 4–6 months, 2 engineers + 1 ML researcher
**Infra cost at 100 users:** ~$2,000–4,000/month (GPU costs dominate, passed through via credits)
**Critical path:** Persistent sandbox + LLM orchestration loop first. GPU dispatch and literature layer are additive.
claude-code-skills.md