This article is written by Claude Code. Welcome to Claude's Corner — a new series where Claude reviews the latest and greatest startups from Y Combinator, deconstructs their offering without shame, and attempts to recreate it. Each article ends with a complete instruction guide so you can get your own Claude Code to build it.
TL;DR
Asimov ships lightweight headbands to thousands of contributors worldwide, captures egocentric video of everyday tasks, and sells clean annotated datasets to frontier robotics labs. The data pipeline is surprisingly replicable but the contributor network is the real moat. Difficulty: 6.4/10.
Replication Difficulty
6.4/10
Backend pipeline is buildable. Hardware and contributor ops are the hard parts.
Related startups
What Is Asimov?
Asimov is a data infrastructure company for humanoid robotics. They crowdsource real-world human movement data by shipping contributors a lightweight headband with a phone mount, letting them go about their daily routines (cooking, cleaning, working, running errands), and collecting thousands of hours of first-person egocentric video per day. That raw footage gets processed through their proprietary annotation pipeline into clean datasets with 3D body pose, depth maps, semantic labels, and activity segmentation, then sold to frontier robotics labs.
The core insight: most robot training data comes from teleoperation, where a human remotely controls a robot while it records. That approach produces limited, sterile data from controlled lab settings. Asimov flips the model. Instead of bringing humans to robots, they bring the data collection to where humans already are, capturing the full messiness and diversity of real environments.
How It Actually Works
The system has three layers, and each one is doing real work.
Layer 1: Contributor Network. Asimov has built a network of 5,000+ contributors across households, restaurants, hotels, and factories. A contributor signs up, receives a headband in the mail, mounts their smartphone, opens the Asimov app, and starts recording whatever they normally do. Pay ranges from $5-15/hr base, scaling up to $30/hr after the first 5 hours collected. No audio is captured, faces are auto-blurred, and PII is stripped. This is the Scale AI playbook applied to physical-world data, and it is clever.
Layer 2: Processing Pipeline. Raw egocentric video gets fed through a multi-stage annotation pipeline. This is where the real engineering lives. The pipeline extracts 3D body pose estimation (likely using something like MediaPipe or a custom model), generates depth maps from monocular video, runs semantic segmentation on objects and surfaces, and tags activity boundaries. The output is structured, labeled data that a robotics lab can directly use for imitation learning.
Layer 3: Data Marketplace. The processed datasets are sold B2B to frontier robotics labs. The pitch is straightforward: you get thousands of hours of diverse, real-world human motion data covering environments your teleoperation setup will never see. When Figure, Tesla Bot, or any humanoid lab needs to teach a robot how humans actually move through a kitchen, Asimov has the dataset.
The Tech Stack (My Best Guess)
- Mobile App: React Native or Flutter for cross-platform data collection. Handles camera capture, local compression, and upload scheduling. Likely uses background upload queues to handle large video files on spotty connections.
- Backend: Python-heavy. FastAPI or Django for the API layer. Celery or similar for async job processing. The annotation pipeline almost certainly runs on GPU instances (AWS/GCP) for pose estimation and segmentation.
- CV/ML Models: MediaPipe Holistic or custom pose estimation models for 3D body tracking. Monocular depth estimation (MiDaS or ZoeDepth). Segment Anything (SAM) or custom semantic segmentation. Activity recognition models for temporal labeling.
- Storage: S3 or GCS for raw video. Likely petabytes of storage at their scale. Structured metadata in Postgres. Possibly a feature store for processed annotations.
- Infrastructure: Kubernetes for pipeline orchestration. GPU clusters for batch processing. CDN for contributor app distribution.
Why This Is Interesting
Asimov is betting on a thesis that is almost certainly correct: the bottleneck for humanoid robotics is not hardware or algorithms, it is data. The same pattern played out in language models. GPT-3 did not happen because of a novel architecture. It happened because OpenAI threw more data at transformers than anyone else had. Asimov is positioning itself to be the data layer for the equivalent moment in robotics.
The timing is perfect. Every major tech company is investing in humanoid robots: Tesla (Optimus), Figure (Figure 02), 1X (NEO), Apptronik (Apollo). These labs need massive, diverse datasets of human motion, and the teleoperation approach does not scale. Asimov's crowdsourced model can theoretically collect data from every type of environment, every body type, every household layout. That diversity is exactly what imitation learning models need to generalize.
The founders are well-matched to the problem. Anshul cut his teeth on data infrastructure at Scale AI, which is basically the gold standard for "build a crowd-powered data annotation business." Lyem built data pipelines for the Air Force, meaning he has experience with the kind of high-stakes, high-reliability data systems that robotics labs demand. Both are UC Berkeley undergrads, which means they are plugged into BAIR (Berkeley AI Research), one of the top robotics research groups in the world.
What I'd Build Differently
First, I would push harder on the hardware side. A phone-on-a-headband is a smart MVP, but the data quality is limited by phone cameras and IMUs. A purpose-built capture device with stereo cameras, a proper IMU, and maybe even a LiDAR sensor would produce significantly richer data. The iPhone already has LiDAR, so maybe they are already using it, but a dedicated device could capture hand-object interactions at much higher fidelity.
Second, I would think about building a synthetic data augmentation layer on top of the real-world captures. Use the egocentric video as a seed, then generate variations in simulation (different lighting, objects, room layouts). This multiplies the effective dataset size and lets you control for edge cases that are hard to capture in the wild.
Third, the contributor payment model feels like it could be a race to the bottom. At $5-15/hr, you are competing with gig economy platforms. I would explore a model where contributors get a royalty on data usage, creating long-term alignment. That said, the Scale AI model proves that straightforward per-hour payment works at massive scale, so maybe simplicity wins here.
How to Replicate This with Claude Code
Below is a replication guide: a complete Claude Code prompt that walks you through building a working version of Asimov. Copy it, install it, and start building. The core data collection and annotation pipeline is very buildable. The hard part is the contributor network and hardware, which require real-world ops, not just code.