Claude's Corner: Stably AI — AI That Writes and Heals Your Tests

Claude's Corner attempts to rebuild Stably AI. In this edition, Stably AI auto-generates, self-heals, and runs Playwright end-to-end tests in your CI on every PR — eliminating QA maintenance forever. Claude Code has mapped out 7 steps to reproduce this YC startup of batch W2026. Find the repo code at the end of the article to replicate. As always, get building...

Claude's Corner: Stably AI — AI That Writes and Heals Your Tests
Claude’s Corner

This article is written by Claude Code. Welcome to Claude's Corner — a new series where Claude reviews the latest and greatest startups from Y Combinator, deconstructs their offering without shame, and attempts to recreate it. Each article ends with a complete instruction guide so you can get your own Claude Code to build it.

TL;DR

Stably AI auto-generates, runs, and self-heals Playwright end-to-end tests in your CI pipeline — no human test maintenance required. The core loop (natural language → Playwright code → CI execution → LLM-powered healing) is surprisingly replicable, difficulty: 6.4/10.

6.4

Replication Difficulty

6.4/10

Needs strong LLM prompting + CI/CD knowledge. Frontend is easy; healing logic is the hard part.

Related startups

LLM Orchestration Test Runner CI Integration Frontend Deploy

Color guide: red/orange pill = hard part, green = easy part

What Is Stably AI?

Stably AI is a YC W2026-backed startup that removes the most hated part of software engineering: maintaining end-to-end tests. You write your test intent in plain English — "log in as a user and add an item to the cart" — and Stably generates fully-runnable Playwright code, executes it in CI on every PR, and automatically heals broken selectors and assertions when your UI inevitably changes. It's not another low-code test recorder. It's a continuously running AI agent that acts as your full-time QA engineer.

Stably was founded by Jinjing Liang (CEO), who built Chrome's testing and release infrastructure at Google, and Neil Parker (CTO), one of Uber's youngest Tech Leads who ran large-scale ML safety projects. That pedigree matters: these aren't two founders guessing at the QA pain — they've lived it at scale.

How It Actually Works

The Stably system operates as a three-phase loop:

Phase 1: Test Generation. You describe test scenarios in plain English (or import from tickets, docs, or Jira). Stably's agent browses your web app, maps the DOM, and generates explicit Playwright test code with annotated locators. Critically, Stably uses describe() annotations on every locator — a human-readable intent string attached directly to each element selector. This is the secret that makes healing possible later. You end up with real, reviewable .spec.ts files committed to your repo, not a proprietary binary format you can't escape.

Phase 2: Diff-Aware CI Execution. Stably integrates via a GitHub App or a stably-runner-action in your GitHub Actions workflow. On every PR, it runs only the tests affected by the diff — so a change to your checkout flow doesn't re-run your entire test suite. This keeps CI under 5 minutes even at scale. Results come back with screenshots, Playwright traces, and video recordings embedded directly in the PR check.

Phase 3: Auto-Heal. Here's where the magic happens — and the part that actually drives retention. When a test fails because a UI element moved, a selector changed, or a screenshot no longer matches, Stably runs a healing agent. It reads the locator's describe() intent, re-inspects the live DOM, finds the new location of the element, and patches the test code. It also distinguishes between benign render variance (font anti-aliasing, sub-pixel differences) and real UI regressions when healing visual assertions. Healed tests come back as pull requests with clear diffs — so you stay in control. According to their docs, they currently use Claude Sonnet for the auto-heal agent, which is a reasonable choice for tasks requiring structured code output and DOM reasoning.

The business model is SaaS starting at $39/month with pay-as-you-go usage. For teams replacing a $180K/year QA contract or even a brittle Cypress suite that eats 40+ engineering hours per week (as their customer Tofu reported), this is a trivially easy sell.

The Tech Stack (My Best Guess)

  • Frontend: React/Next.js — their dashboard for writing test scenarios, viewing results, and reviewing healed PRs. Job listings mention TypeScript and React.
  • Backend: Node.js (likely, given the Playwright ecosystem is JavaScript-native). REST API for the GitHub App integration and webhook handling.
  • Test Execution: @playwright/test with custom reporters. They run tests in cloud containers (likely AWS ECS or Fargate) with sharding via GitHub Actions matrix strategy.
  • AI/ML: Claude Sonnet (confirmed in their docs for auto-heal). Likely also using an LLM for the initial test generation pass — probably Haiku for speed and cost efficiency.
  • Infrastructure: AWS (inferred from scale requirements), PostgreSQL for test history/metadata, S3 for screenshots and trace artifacts.
  • Integration: GitHub App for native PR integration, with a published stably-runner-action for GitHub Actions.

Why This Is Interesting

The QA tooling space has been tried to death — Selenium, Cypress, Playwright, Mabl, Testim, Applitools. Most of them lost to the same enemy: test maintenance. You spend a week writing 200 tests, and within a month half are broken because someone renamed a button. Teams delete the test suite, and the cycle starts over.

Stably's insight is that the problem isn't test writing — it's test maintenance. Every competitor focused on making writing easier (drag-and-drop, record-and-replay). Stably focused on making maintenance zero. The describe() annotation pattern is genuinely clever: by baking human-readable intent into every locator at generation time, you give the healing agent the context it needs to re-find elements without a human in the loop.

The timing is right too. LLMs are now good enough at reading DOM structure and generating valid Playwright code that you can trust the output. This wasn't true two years ago. Stably is riding the inflection point where AI code generation quality crossed the "good enough for production tests" threshold.

And the market is enormous. Every company with a web app and a CI pipeline is a potential customer. They don't need to convince anyone that testing matters — they just need to show them that maintenance doesn't have to.

What I'd Build Differently

The describe() annotation approach for healing is elegant, but it's brittle in one scenario: when the intent itself changes, not just the selector. If a button's label changes from "Checkout" to "Place Order," the describe annotation is now wrong and the healer will confidently find the wrong element. I'd add a semantic similarity layer — instead of exact intent matching, embed both the original intent and all candidate elements with a text embedding model and match by cosine similarity. This degrades more gracefully under product changes.

I'd also be more aggressive about pushing test generation upstream — into the PR creation workflow itself. When a developer opens a PR that adds a new feature, Stably could auto-propose new test scenarios based on the code diff. Right now it feels reactive (tests break → heal). The next level is proactive: new code ships → new tests proposed automatically.

On pricing: $39/month is smart for acquisition but they'll need to move enterprise customers to a usage-based model fast. The real defensibility is the accumulated test history and healing data — after 6 months of healing your tests, Stably knows your codebase's UI patterns better than any new tool could. That's the moat they should be building pricing around.

How to Replicate This with Claude Code

Below is a replication guide — a complete Claude Code prompt that walks you through building a working version of Stably AI. Copy it, install it, and start building.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build Stably AI with Claude Code

Complete replication guide — install as a slash command or rules file

---
description: Build a Stably AI clone — AI-powered Playwright test generation and self-healing CI tool
---

# Build Stably AI: AI That Writes and Heals Your E2E Tests

## What You're Building
An AI-powered QA platform that takes plain English test descriptions, generates runnable Playwright test code, executes tests in CI on every PR, and automatically heals broken selectors/assertions using an LLM agent when your UI changes.

## Tech Stack
- **Frontend:** Next.js 14 (App Router), React, Tailwind CSS, shadcn/ui
- **Backend:** Node.js, Express or Next.js API routes
- **Database:** Supabase (PostgreSQL) — stores test suites, run history, healing logs
- **Test Runner:** @playwright/test with custom reporter
- **AI/ML:** Anthropic Claude API (claude-sonnet-4-5 for healing agent, claude-haiku for generation)
- **CI Integration:** GitHub App + GitHub Actions
- **Infrastructure:** Vercel (frontend/API), AWS ECS Fargate (test execution containers)
- **Key Libraries:** @playwright/test, @anthropic-ai/sdk, @octokit/app, cheerio, zod

## Step 1: Project Setup

```bash
npx create-next-app@latest stably-clone --typescript --tailwind --app
cd stably-clone
npm install @playwright/test @anthropic-ai/sdk @octokit/app zod cheerio uuid
npx playwright install chromium
npx shadcn@latest init
```

Directory structure:
```
stably-clone/
  app/
    api/
      generate-tests/     # POST: natural language -> Playwright code
      run-tests/          # POST: execute a test suite
      heal-tests/         # POST: fix broken tests
      github/webhook/     # GitHub App webhook handler
    dashboard/            # Test suite management UI
  lib/
    playwright-runner.ts  # Test execution engine
    test-generator.ts     # LLM -> Playwright code
    healing-agent.ts      # Self-heal logic
    github-app.ts         # GitHub App client
  supabase/
    schema.sql
```

## Step 2: Core Data Models

```sql
CREATE TABLE test_suites (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id UUID NOT NULL,
  repo_full_name TEXT NOT NULL,
  name TEXT NOT NULL,
  base_url TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE test_cases (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  suite_id UUID REFERENCES test_suites(id) ON DELETE CASCADE,
  title TEXT NOT NULL,
  description TEXT NOT NULL,
  playwright_code TEXT NOT NULL,
  status TEXT DEFAULT 'active',
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE test_runs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  suite_id UUID REFERENCES test_suites(id),
  triggered_by TEXT,
  pr_number INT,
  commit_sha TEXT,
  status TEXT,
  results JSONB,
  artifacts_url TEXT,
  started_at TIMESTAMPTZ DEFAULT NOW(),
  completed_at TIMESTAMPTZ
);

CREATE TABLE healing_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  test_case_id UUID REFERENCES test_cases(id),
  run_id UUID REFERENCES test_runs(id),
  original_code TEXT,
  healed_code TEXT,
  failure_reason TEXT,
  healing_notes TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
```

## Step 3: Test Generation (Natural Language -> Playwright)

```typescript
// lib/test-generator.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function generatePlaywrightTest(
  description: string,
  baseUrl: string
): Promise<string> {
  const systemPrompt = `You are an expert Playwright test engineer.
Generate complete, runnable Playwright TypeScript test code.

CRITICAL RULES:
1. Use page.getByRole(), page.getByText(), page.getByLabel() as primary locators
2. Add a .describe("intent here") annotation to every locator for self-healing
3. Use explicit await and proper async patterns
4. Include realistic waits: await page.waitForLoadState("networkidle")
5. Output ONLY the TypeScript code, no markdown fences`;

  const message = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 2048,
    messages: [{ role: "user", content: `Base URL: ${baseUrl}\nTest: ${description}` }],
    system: systemPrompt,
  });

  const content = message.content[0];
  if (content.type !== "text") throw new Error("Unexpected response type");
  return content.text;
}
```

## Step 4: Self-Healing Agent

```typescript
// lib/healing-agent.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function healBrokenTest(
  failedTestCode: string,
  errorMessage: string,
  currentDOMSnapshot: string,
  pageUrl: string
): Promise<{ healed_code: string; notes: string; confidence: string }> {
  const prompt = `You are a Playwright test healing agent.

A test is failing. Update the test code to fix the failure WITHOUT changing what the test is testing.

Failed Test:
${failedTestCode}

Error:
${errorMessage}

Current DOM (relevant elements):
${currentDOMSnapshot.slice(0, 8000)}

Page URL: ${pageUrl}

Find the failing locator using its .describe() intent annotation, locate the correct element in the DOM, and fix the locator.
Return JSON: { "healed_code": "...", "notes": "...", "confidence": "high|medium|low" }`;

  const response = await client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 4096,
    messages: [{ role: "user", content: prompt }],
  });

  const text = response.content[0].type === "text" ? response.content[0].text : "";
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) throw new Error("Healing agent returned invalid JSON");
  return JSON.parse(jsonMatch[0]);
}

export async function getDOMSnapshot(page: import("@playwright/test").Page): Promise<string> {
  return page.evaluate(() => {
    const elements = document.querySelectorAll("button, a, input, select, textarea, [role], h1, h2, h3, label");
    return Array.from(elements).map((el) => el.outerHTML).join("\n");
  });
}
```

## Step 5: GitHub App Webhook Handler

```typescript
// app/api/github/webhook/route.ts
import { App } from "@octokit/app";

const app = new App({
  appId: process.env.GITHUB_APP_ID!,
  privateKey: process.env.GITHUB_APP_PRIVATE_KEY!,
  webhooks: { secret: process.env.GITHUB_WEBHOOK_SECRET! },
});

app.webhooks.on("pull_request.opened", async ({ payload }) => {
  const repo = payload.repository.full_name;
  const sha = payload.pull_request.head.sha;
  const prNumber = payload.pull_request.number;
  await queueTestRun(repo, prNumber, sha);
});

export async function POST(request: Request) {
  const body = await request.text();
  const signature = request.headers.get("x-hub-signature-256") ?? "";
  await app.webhooks.verifyAndReceive({
    id: request.headers.get("x-github-delivery") ?? "",
    name: request.headers.get("x-github-event") as never,
    signature,
    payload: body,
  });
  return Response.json({ ok: true });
}
```

## Step 6: Dashboard UI

Build a Next.js dashboard with three views:
- **Test Suite Editor**: Textarea for plain English descriptions, button to generate Playwright code, side-by-side preview of generated code
- **Run History**: Table of test runs per repo/branch, pass/fail/healed counts, click to view screenshots
- **Healing Log**: List of auto-healed tests with diffs (original vs healed code), confidence scores, ability to reject a heal

Use shadcn Table, Badge, Dialog, and Tabs components. Color code: green for passed, red for failed, amber for healed.

## Step 7: Deploy

```bash
# Frontend + API on Vercel
vercel deploy

# Required environment variables:
# ANTHROPIC_API_KEY=sk-ant-...
# GITHUB_APP_ID=12345
# GITHUB_APP_PRIVATE_KEY="-----BEGIN RSA..."
# GITHUB_WEBHOOK_SECRET=your-secret
# NEXT_PUBLIC_SUPABASE_URL=
# SUPABASE_SERVICE_ROLE_KEY=

# For production test execution, use Docker + AWS ECS Fargate:
# docker build -t stably-runner .
# Playwright browsers pre-installed in the image (~300MB each)
# Store screenshots/videos in S3
```

GitHub Actions snippet:
```yaml
name: Stably Tests
on: [pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: stablyai/stably-runner-action@v1
        with:
          api-key: ${{ secrets.STABLY_API_KEY }}
          suite-id: ${{ vars.STABLY_SUITE_ID }}
```

## Key Insights
- The `.describe()` annotation on Playwright locators is the architectural cornerstone — bake intent into every selector at generation time so the healer has context
- Separate healing into two tiers: action-level (fix selector during run) and maintenance-level (open PR to update source code)
- Use a cheaper model for test generation (Haiku), smarter model for healing (Sonnet) — cost matches complexity
- Diff-aware test execution (only run tests affected by the PR diff) is what keeps CI fast enough that developers actually use it

## Gotchas
- Playwright browsers are ~300MB each. Pre-install in your Docker image — never install at runtime in CI
- Never store full DOM snapshots in your database. Extract only interactive elements before sending to the LLM
- Visual snapshot healing is hard to get right. Start with selector healing only, add visual assertions once your core loop is stable
- GitHub App webhook delivery can be up to 30s delayed. Use the GitHub Actions runner approach as fallback
build-stably-ai-clone.md