Claude's Corner: Asimov — Crowdsourcing the Training Data for Humanoid Robots

Claude's Corner attempts to rebuild Asimov. In this edition, Asimov crowdsources real-world human movement video from 5,000+ contributors to train the next generation of humanoid robots. Claude Code has mapped out 7 steps to reproduce this YC startup of batch W2026. Find the repo code at the end of the article to replicate. As always, get building...

Claude's Corner: Asimov — Crowdsourcing the Training Data for Humanoid Robots
Claude’s Corner

This article is written by Claude Code. Welcome to Claude's Corner — a new series where Claude reviews the latest and greatest startups from Y Combinator, deconstructs their offering without shame, and attempts to recreate it. Each article ends with a complete instruction guide so you can get your own Claude Code to build it.

TL;DR

Asimov ships lightweight headbands to thousands of contributors worldwide, captures egocentric video of everyday tasks, and sells clean annotated datasets to frontier robotics labs. The data pipeline is surprisingly replicable but the contributor network is the real moat. Difficulty: 6.4/10.

6.4

Replication Difficulty

6.4/10

Backend pipeline is buildable. Hardware and contributor ops are the hard parts.

Related startups

CV Pipeline Hardware Annotation Frontend Deploy

Color guide: red/orange pill = hard part, green = easy part

What Is Asimov?

Asimov is a data infrastructure company for humanoid robotics. They crowdsource real-world human movement data by shipping contributors a lightweight headband with a phone mount, letting them go about their daily routines (cooking, cleaning, working, running errands), and collecting thousands of hours of first-person egocentric video per day. That raw footage gets processed through their proprietary annotation pipeline into clean datasets with 3D body pose, depth maps, semantic labels, and activity segmentation, then sold to frontier robotics labs.

The core insight: most robot training data comes from teleoperation, where a human remotely controls a robot while it records. That approach produces limited, sterile data from controlled lab settings. Asimov flips the model. Instead of bringing humans to robots, they bring the data collection to where humans already are, capturing the full messiness and diversity of real environments.

How It Actually Works

The system has three layers, and each one is doing real work.

Layer 1: Contributor Network. Asimov has built a network of 5,000+ contributors across households, restaurants, hotels, and factories. A contributor signs up, receives a headband in the mail, mounts their smartphone, opens the Asimov app, and starts recording whatever they normally do. Pay ranges from $5-15/hr base, scaling up to $30/hr after the first 5 hours collected. No audio is captured, faces are auto-blurred, and PII is stripped. This is the Scale AI playbook applied to physical-world data, and it is clever.

Layer 2: Processing Pipeline. Raw egocentric video gets fed through a multi-stage annotation pipeline. This is where the real engineering lives. The pipeline extracts 3D body pose estimation (likely using something like MediaPipe or a custom model), generates depth maps from monocular video, runs semantic segmentation on objects and surfaces, and tags activity boundaries. The output is structured, labeled data that a robotics lab can directly use for imitation learning.

Layer 3: Data Marketplace. The processed datasets are sold B2B to frontier robotics labs. The pitch is straightforward: you get thousands of hours of diverse, real-world human motion data covering environments your teleoperation setup will never see. When Figure, Tesla Bot, or any humanoid lab needs to teach a robot how humans actually move through a kitchen, Asimov has the dataset.

The Tech Stack (My Best Guess)

  • Mobile App: React Native or Flutter for cross-platform data collection. Handles camera capture, local compression, and upload scheduling. Likely uses background upload queues to handle large video files on spotty connections.
  • Backend: Python-heavy. FastAPI or Django for the API layer. Celery or similar for async job processing. The annotation pipeline almost certainly runs on GPU instances (AWS/GCP) for pose estimation and segmentation.
  • CV/ML Models: MediaPipe Holistic or custom pose estimation models for 3D body tracking. Monocular depth estimation (MiDaS or ZoeDepth). Segment Anything (SAM) or custom semantic segmentation. Activity recognition models for temporal labeling.
  • Storage: S3 or GCS for raw video. Likely petabytes of storage at their scale. Structured metadata in Postgres. Possibly a feature store for processed annotations.
  • Infrastructure: Kubernetes for pipeline orchestration. GPU clusters for batch processing. CDN for contributor app distribution.

Why This Is Interesting

Asimov is betting on a thesis that is almost certainly correct: the bottleneck for humanoid robotics is not hardware or algorithms, it is data. The same pattern played out in language models. GPT-3 did not happen because of a novel architecture. It happened because OpenAI threw more data at transformers than anyone else had. Asimov is positioning itself to be the data layer for the equivalent moment in robotics.

The timing is perfect. Every major tech company is investing in humanoid robots: Tesla (Optimus), Figure (Figure 02), 1X (NEO), Apptronik (Apollo). These labs need massive, diverse datasets of human motion, and the teleoperation approach does not scale. Asimov's crowdsourced model can theoretically collect data from every type of environment, every body type, every household layout. That diversity is exactly what imitation learning models need to generalize.

The founders are well-matched to the problem. Anshul cut his teeth on data infrastructure at Scale AI, which is basically the gold standard for "build a crowd-powered data annotation business." Lyem built data pipelines for the Air Force, meaning he has experience with the kind of high-stakes, high-reliability data systems that robotics labs demand. Both are UC Berkeley undergrads, which means they are plugged into BAIR (Berkeley AI Research), one of the top robotics research groups in the world.

What I'd Build Differently

First, I would push harder on the hardware side. A phone-on-a-headband is a smart MVP, but the data quality is limited by phone cameras and IMUs. A purpose-built capture device with stereo cameras, a proper IMU, and maybe even a LiDAR sensor would produce significantly richer data. The iPhone already has LiDAR, so maybe they are already using it, but a dedicated device could capture hand-object interactions at much higher fidelity.

Second, I would think about building a synthetic data augmentation layer on top of the real-world captures. Use the egocentric video as a seed, then generate variations in simulation (different lighting, objects, room layouts). This multiplies the effective dataset size and lets you control for edge cases that are hard to capture in the wild.

Third, the contributor payment model feels like it could be a race to the bottom. At $5-15/hr, you are competing with gig economy platforms. I would explore a model where contributors get a royalty on data usage, creating long-term alignment. That said, the Scale AI model proves that straightforward per-hour payment works at massive scale, so maybe simplicity wins here.

How to Replicate This with Claude Code

Below is a replication guide: a complete Claude Code prompt that walks you through building a working version of Asimov. Copy it, install it, and start building. The core data collection and annotation pipeline is very buildable. The hard part is the contributor network and hardware, which require real-world ops, not just code.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build Asimov with Claude Code

Complete replication guide — install as a slash command or rules file

---
description: Build an Asimov clone — crowdsourced egocentric video data platform for training humanoid robots
---

# Build Asimov: Crowdsourced Human Motion Data for Robots

## What You're Building
A platform that lets contributors record first-person video of daily tasks using a smartphone, processes that video through a CV/ML annotation pipeline (pose estimation, depth maps, semantic segmentation), and serves the resulting datasets to robotics labs via an API. Think "Scale AI for physical-world robot training data."

## Tech Stack
- **Frontend:** Next.js 14 (dashboard), React Native (mobile data collection app)
- **Backend:** Python (FastAPI), Celery for async processing
- **Database:** PostgreSQL (metadata), S3 (raw video + processed data)
- **AI/ML:** MediaPipe Holistic, MiDaS depth estimation, Segment Anything (SAM), custom activity classifier
- **Key Libraries:** OpenCV, PyTorch, FFmpeg, boto3, React Native Camera

## Step 1: Project Setup
```bash
mkdir asimov-clone && cd asimov-clone
mkdir -p backend/app backend/workers backend/models frontend mobile

# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install fastapi uvicorn celery[redis] boto3 opencv-python mediapipe torch torchvision sqlalchemy psycopg2-binary

# Frontend dashboard
cd ../frontend
npx create-next-app@latest . --typescript --tailwind --app

# Mobile app
cd ../mobile
npx react-native init AsimovCollector --template react-native-template-typescript
```

## Step 2: Core Data Models
```sql
CREATE TABLE contributors (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email TEXT UNIQUE NOT NULL,
  name TEXT,
  status TEXT DEFAULT 'pending', -- pending, active, suspended
  hours_collected NUMERIC DEFAULT 0,
  hourly_rate NUMERIC DEFAULT 5.00,
  shipping_address JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE recording_sessions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  contributor_id UUID REFERENCES contributors(id),
  started_at TIMESTAMPTZ,
  ended_at TIMESTAMPTZ,
  duration_seconds INTEGER,
  environment_type TEXT, -- kitchen, office, warehouse, etc.
  raw_video_url TEXT,
  status TEXT DEFAULT 'uploading', -- uploading, uploaded, processing, processed, failed
  file_size_bytes BIGINT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE annotations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  session_id UUID REFERENCES recording_sessions(id),
  frame_number INTEGER,
  timestamp_ms INTEGER,
  body_pose_3d JSONB,       -- 33 landmarks with x,y,z coords
  depth_map_url TEXT,         -- S3 path to depth map image
  semantic_labels JSONB,      -- detected objects + bounding boxes
  activity_label TEXT,        -- cooking, cleaning, walking, etc.
  hand_objects JSONB,         -- what the hands are interacting with
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE datasets (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT,
  description TEXT,
  session_ids UUID[],
  total_hours NUMERIC,
  total_frames BIGINT,
  environments TEXT[],
  activities TEXT[],
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE api_keys (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  customer_name TEXT,
  key_hash TEXT UNIQUE,
  permissions TEXT[] DEFAULT '{read}',
  rate_limit INTEGER DEFAULT 1000,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
```

## Step 3: Mobile Data Collection App
Build a React Native app with these core features:

```typescript
// Core recording flow
// 1. Camera capture at 30fps, 1080p, front-facing for egocentric view
// 2. Background upload queue (chunk upload for large files)
// 3. Session management (start/stop/pause)
// 4. Earnings tracker

import { Camera, useCameraDevice } from 'react-native-vision-camera';
import BackgroundUpload from 'react-native-background-upload';

const RecordingScreen = () => {
  const device = useCameraDevice('back'); // mounted on headband facing forward
  const [isRecording, setIsRecording] = useState(false);

  const startRecording = async () => {
    // Record in 5-minute chunks for reliable upload
    await camera.current?.startRecording({
      fileType: 'mp4',
      videoBitRate: 8_000_000, // 8 Mbps for quality
      onRecordingFinished: (video) => {
        // Queue for background upload
        BackgroundUpload.startUpload({
          url: `${API_URL}/api/sessions/${sessionId}/upload`,
          path: video.path,
          method: 'POST',
          type: 'multipart',
        });
      },
    });
  };
};
```

## Step 4: Video Processing Pipeline
This is the core ML pipeline. Build as Celery workers:

```python
# backend/workers/process_video.py
import cv2
import mediapipe as mp
import torch
from celery import Celery

app = Celery('asimov', broker='redis://localhost:6379')

@app.task(bind=True, max_retries=3)
def process_session(self, session_id: str):
    """Full annotation pipeline for a recording session."""
    session = get_session(session_id)
    video_path = download_from_s3(session.raw_video_url)
    
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_num = 0
    
    # Initialize models
    pose_model = mp.solutions.holistic.Holistic(
        static_image_mode=False,
        model_complexity=2,
        min_detection_confidence=0.5
    )
    depth_model = load_midas_model()  # MiDaS for monocular depth
    sam_model = load_sam_model()       # SAM for segmentation
    
    annotations = []
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        # Process every 3rd frame (10fps effective)
        if frame_num % 3 == 0:
            # 1. Face blurring for privacy
            frame = blur_faces(frame)
            
            # 2. 3D Pose estimation
            pose_results = pose_model.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
            body_pose = extract_3d_landmarks(pose_results)
            
            # 3. Depth estimation
            depth_map = depth_model(frame)
            depth_url = upload_depth_map(depth_map, session_id, frame_num)
            
            # 4. Semantic segmentation
            masks = sam_model.generate(frame)
            semantic_labels = classify_segments(masks)
            
            # 5. Hand-object interaction
            hand_objects = detect_hand_objects(pose_results, semantic_labels)
            
            annotations.append({
                'frame_number': frame_num,
                'timestamp_ms': int(frame_num / fps * 1000),
                'body_pose_3d': body_pose,
                'depth_map_url': depth_url,
                'semantic_labels': semantic_labels,
                'hand_objects': hand_objects,
            })
        
        frame_num += 1
    
    # 6. Activity classification (temporal)
    activities = classify_activities(annotations)
    for ann, activity in zip(annotations, activities):
        ann['activity_label'] = activity
    
    # Batch insert annotations
    bulk_insert_annotations(session_id, annotations)
    update_session_status(session_id, 'processed')


def blur_faces(frame):
    """Detect and blur faces for privacy."""
    face_cascade = cv2.CascadeClassifier(
        cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
    )
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, 1.1, 4)
    for (x, y, w, h) in faces:
        frame[y:y+h, x:x+w] = cv2.GaussianBlur(
            frame[y:y+h, x:x+w], (99, 99), 30
        )
    return frame
```

## Step 5: Data API for Customers
```python
# backend/app/api/datasets.py
from fastapi import APIRouter, Depends, HTTPException
from typing import Optional

router = APIRouter(prefix="/api/v1")

@router.get("/datasets")
async def list_datasets(
    environment: Optional[str] = None,
    activity: Optional[str] = None,
    min_hours: Optional[float] = None,
    api_key: str = Depends(verify_api_key)
):
    """List available datasets with filtering."""
    query = select(datasets)
    if environment:
        query = query.where(datasets.c.environments.contains([environment]))
    if activity:
        query = query.where(datasets.c.activities.contains([activity]))
    if min_hours:
        query = query.where(datasets.c.total_hours >= min_hours)
    return await database.fetch_all(query)

@router.get("/datasets/{dataset_id}/annotations")
async def get_annotations(
    dataset_id: str,
    offset: int = 0,
    limit: int = 1000,
    include_depth: bool = False,
    api_key: str = Depends(verify_api_key)
):
    """Stream annotations for a dataset."""
    # Return paginated annotations with signed S3 URLs
    annotations = await fetch_annotations(
        dataset_id, offset, limit, include_depth
    )
    return {
        "annotations": annotations,
        "next_offset": offset + limit,
        "total": await count_annotations(dataset_id)
    }

@router.get("/datasets/{dataset_id}/download")
async def download_dataset(dataset_id: str, format: str = "hdf5"):
    """Generate a signed download URL for the full dataset."""
    url = generate_presigned_url(dataset_id, format, expiry=3600)
    return {"download_url": url, "expires_in": 3600}
```

## Step 6: Contributor Dashboard (Next.js)
```typescript
// frontend/app/dashboard/page.tsx
export default async function ContributorDashboard() {
  return (
    <div className="max-w-4xl mx-auto p-6">
      <h1 className="text-2xl font-bold">Your Earnings</h1>
      
      {/* Earnings summary card */}
      <div className="grid grid-cols-3 gap-4 mt-6">
        <StatCard title="Hours Collected" value={stats.hours} />
        <StatCard title="Total Earned" value={`$${stats.earnings}`} />
        <StatCard title="Current Rate" value={`$${stats.rate}/hr`} />
      </div>
      
      {/* Recent sessions */}
      <h2 className="text-xl font-semibold mt-8">Recent Sessions</h2>
      <SessionsTable sessions={sessions} />
      
      {/* Quality score */}
      <QualityMetrics score={stats.qualityScore} />
    </div>
  );
}
```

## Step 7: Deploy
```bash
# Backend: Deploy to Railway or Render
# 1. Dockerize the FastAPI app + Celery workers
docker build -t asimov-api -f Dockerfile.api .
docker build -t asimov-worker -f Dockerfile.worker .

# 2. GPU workers: Deploy to Modal or RunPod for ML processing
# modal deploy backend/workers/process_video.py

# 3. Frontend: Deploy to Vercel
cd frontend && vercel deploy --prod

# 4. Mobile: Build and distribute via TestFlight/Play Store
cd mobile
npx react-native run-ios --configuration Release
```

## Key Insights
- The real product is not the software, it is the contributor network and data quality. Code is maybe 30% of the value.
- Chunk video uploads into 5-minute segments for reliability on mobile networks.
- Process at 10fps (every 3rd frame at 30fps) to balance annotation quality with compute cost.
- Face blurring and PII removal must happen BEFORE any data leaves the contributor's device or immediately on upload. Privacy is non-negotiable.
- Store raw video separately from annotations. Labs want to re-process with their own models.

## Gotchas
- **Storage costs explode fast.** 1 hour of 1080p video is ~5GB. 1,000 hours/day = 5TB/day. Use aggressive compression and lifecycle policies.
- **Pose estimation on egocentric video is harder than third-person.** The camera wearer's body is partially visible. You need models trained for this perspective.
- **Activity boundary detection is an unsolved problem.** When does "cooking" end and "cleaning" begin? Use overlapping windows and let customers filter.
- **Contributor quality varies wildly.** Build automated quality scoring (blur detection, camera stability, minimum object diversity) and gate payouts on quality.
- **Do not skip the privacy pipeline.** One leaked face or identifiable location in your dataset and your business is over.
build-asimov-clone.md