Claude's Corner: Human Archive — Building Common Crawl for Robot Hands

Human Archive is building the Common Crawl for robot hands — a multimodal dataset company collecting synchronized tactile, depth, IMU, and vision data at scale to feed the physical AI training famine. Here's how it works and how hard it is to clone.

Jun 15 at 11:21 AM9 min read

Claude's Corner: Human Archive — Building Common Crawl for Robot Hands

TL;DR

Human Archive is building the Common Crawl for physical AI — a proprietary multimodal dataset company capturing synchronized tactile, depth, IMU, and vision data from 50,000 contributors across industrial and domestic environments. Their data infrastructure targets frontier robotics labs that need sensorimotor training data at a scale nobody has self-collected.

7.0

Build difficulty

There's a quiet crisis unfolding at every serious robotics company in the world. It's not compute. It's not model architecture. It's data.

The internet gave large language models their intelligence. Scraped text from Reddit arguments, Wikipedia edits, and Stack Overflow answers — all compressed into the weights of GPT-4, Claude, Gemini. It worked because human language was already digitized and abundant.

Physical intelligence has no such gift. Nobody's been uploading footage of their hands screwing in bolts, folding laundry, or loading a dishwasher alongside IMU readings, depth maps, and synchronized tactile force data. Because why would they?

Related startups

Human Archive is trying to create what the internet never had: a Common Crawl for human sensorimotor intelligence. They're doing it the hard way — custom hardware rigs, a 50,000-person contributor network, national partnerships across industrial and domestic environments, and operations distributed across two continents. If they pull it off, every serious robotics foundation model will be trained on their data.

What They Do

Human Archive sells multimodal datasets to frontier AI labs and robotics companies building foundation models for physical AI. The flagship product is HA-Multi — a fully synchronized multimodal dataset that simultaneously captures:

RGB vision from first-person and wrist-mounted cameras
Stereo depth via IR dot projection
Tactile force data from instrumented gloves
Body IMUs distributed across multiple limb segments
3D MANO hand reconstructions
Per-timestamp depth maps
Full human pose reconstruction using SLAM
Task labels, object segmentation, and environment descriptions

There's also HA-Ego — a lighter product using mono RGB and a wrist camera for teams that don't yet need the full sensorimotor stack. Think of it as the on-ramp to HA-Multi for robotics labs still figuring out what data modalities their foundation models actually need.

The customers are anyone building a physical AI foundation model: robotics startups, big tech labs racing to put general-purpose robots in homes and warehouses, and embodied AI researchers who need real-world sensorimotor variety at a scale they can't self-collect. Instead of predicting the next token, these models predict the next joint angle, grasp force, or wrist rotation — and for that, you need training data Human Archive is uniquely positioned to supply.

How It Works

The core engineering challenge is not software. It's orchestrating a data collection operation at industrial scale while maintaining research-grade sensor quality.

Each custom rig integrates multiple synchronized hardware streams: depth cameras running IR dot projection, wrist-mounted RGB cameras, tactile glove hardware capturing finger pressure and contact force, IMU pods distributed across the body, and embedded compute to timestamp-align everything in real time. That alignment is non-negotiable — a 50ms drift between the tactile signal and the depth frame renders the data useless for training dexterous manipulation. Human Archive had to design rigs capable of sub-millisecond synchronization across heterogeneous sensor types. Consumer hardware doesn't cut it.

On the software side, they've built:

QA pipelines that automatically detect sensor failures, timestamp drift, and annotation errors before data enters the warehouse — catching bad sessions before they pollute the training corpus
Annotation tooling for task labeling, object segmentation, environment tagging, and hand tracking verification across thousands of collection sessions
Internal policy benchmarking models to evaluate whether a new dataset batch actually improves downstream task performance — closing the loop between collection and usefulness
Terabyte-scale ingestion infrastructure on AWS capable of handling up to 8,000 hours of raw sensor data per day at peak

Collection environments span homes, restaurants, hotels, retail stores, transportation hubs, construction sites, and agricultural settings. That breadth is deliberate. A dexterous robot trained only on kitchen manipulation data fails when it hits a warehouse floor with unfamiliar lighting, textures, and task structures. Dataset diversity is not a nice-to-have — it's what determines whether the foundation models trained on your data generalize.

Operations run out of two geographies: San Francisco for engineering and sales, India for the collection workforce. Human Archive has built a 25-person operations team to manage contributor onboarding, rig deployment, session monitoring, and data quality control. The scale ambition is explicit: they've signed national-level partnerships to grow their contributor network to 50,000 people and operate 1,000+ custom rigs simultaneously.

The Team

Four founders, all in their mid-twenties — Raj Patel (Berkeley dropout, former farmer), Rushil Agarwal (UC Berkeley MET), Samay Maini, and Shloke Patel (robotics engineer). The farming background on Raj isn't incidental. Agricultural automation is one of the least-solved robotic applications because task variance in farming is brutally high: crop types, field conditions, tool ergonomics, seasonal variation. Someone who's actually done manual farm work understands precisely the kind of sensorimotor diversity that needs to be in the training data.

The advisory bench draws from OpenAI, BAIR, SAIL, Anduril Industries, NVIDIA, Jane Street, Google, and DoorDash AI Research. The $8.2M seed and YC W2026 backing gives them runway to build out the collection infrastructure before the physical AI market reaches full boil.

Difficulty Score

Dimension	Score	Notes
ML / AI	8 / 10	Sensor fusion, SLAM, 3D MANO reconstruction, multimodal alignment, and internal policy benchmarking are all genuinely hard research problems
Data	9 / 10	The data is the product. Building proprietary collection ops at this scale is the entire company thesis
Backend	7 / 10	Terabyte-scale ingestion with real-time QA and distributed sensor fleet management; solid engineering but no novel algorithms
Frontend	4 / 10	Annotation interfaces and customer dataset portals are standard web work
DevOps	7 / 10	Distributed edge collection from 1,000+ rigs, sensor firmware management, AWS at scale — operationally complex

The Moat

The moat is the data, and the moat builder is time.

You can't decide to have 50,000 contributors, 125 national partnerships, and 1,000 custom hardware rigs next Tuesday. The national partnerships alone — deals with employers and institutions that let Human Archive deploy data collection across their facilities — take months to negotiate per deal and represent significant relationship capital that a competitor would have to rebuild from scratch.

The custom rigs are also not off-the-shelf. Consumer sensors don't meet the synchronization precision requirements for research-grade multimodal data. Human Archive designed and sourced specialized hardware — an investment in time, engineering, and supply chain management that most robotics customers don't want to replicate. They'll pay for the dataset instead.

The harder-to-replicate element is variance coverage. Their dataset spans multiple countries, multiple environment types, multiple task categories, and diverse human body morphologies. That breadth matters because foundation models are only as generalizable as the distribution of their training data. A competitor starting today with equal capital would need 12–18 months just to reach current parity — during which time Human Archive keeps collecting and the gap widens.

The data flywheel is real: more collection infrastructure → lower per-hour collection cost → more data per dollar → better models trained on their data → more customers → more revenue → more infrastructure. Scale AI rode a similar flywheel to dominance in 2D image annotation. Human Archive is betting the same dynamic holds for sensorimotor data.

What's Easy to Replicate

The software stack is not uniquely defensible. Label Studio, CVAT, and Roboflow handle 80% of annotation infrastructure needs. AWS S3 + SQS handles terabyte-scale ingestion at reasonable cost. The QA pipeline logic is standard engineering — sensor failure detection and timestamp drift checks are not secret algorithms.

The business model is also replicable in structure. There are no network effects between customers — frontier labs will buy from whoever has the best data at the right price. If a better-capitalized competitor builds a larger, higher-quality dataset, customers will switch. Human Archive's defensibility is the head start, not some structural lock-in.

The real threat is a well-resourced player deciding this market is worth entering: a Scale AI pivot into embodied data, a Hugging Face partnership with a robotics OEM, or a state-backed Chinese effort that can deploy collection at even larger scale. None of those have materialized at the time of writing. But this is a race that can be run with money and time, which means the lead is perishable.

Replicability Score: 71 / 100

Human Archive sits firmly in the "real moat" band — but it's a capital-and-time moat rather than a deep-IP moat. There's no decade of proprietary research, no hardware IP that can't be reverse-engineered, and no regulatory capture. What exists is a large head start in an asset that takes time to build: sensor rigs, contributor networks, national partnerships, and terabytes of proprietary multimodal data. A well-funded competitor could theoretically close the gap, but it would take 18+ months and hundreds of millions of dollars. In the current robotics market timeline, that's a defensible position.

The score would climb into the 80s if Human Archive locked in exclusive data supply agreements with major robotics OEMs or developed truly proprietary sensor technology that couldn't be replicated. As is, their defensibility comes from running fast, not from building walls.

The Bigger Picture

The physical AI market is about to be very large. Google DeepMind, Meta, Amazon, and a dozen well-funded robotics startups are all racing to deploy general-purpose robotic systems. Every one of them needs foundation model training data for physical tasks. None of them want to build collection infrastructure themselves — that's slow, expensive, and orthogonal to their core competency.

Human Archive has picked the right level of abstraction: infrastructure, not application. They're not building a robot. They're building the training corpus that makes robots possible. That's the same bet AWS made on cloud compute in 2006: you don't need to build software, you need to make software possible for everyone else.

The question isn't whether this market exists. It's whether Human Archive can stay far enough ahead of Scale AI, Hugging Face, and state-backed competitors to become the canonical answer to "where does physical AI training data come from?" Right now, they're the closest thing to that answer anyone has.

Put differently: if you believe general-purpose robots are coming — and the investment flowing into physical AI makes that thesis hard to argue with — then the data layer is going to matter enormously. Human Archive is building it. That's worth paying attention to.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# How to Build a Robotics Sensorimotor Dataset Platform (Human Archive Clone)

A 7-step guide to building a multimodal data collection and licensing platform for physical AI training.

## Step 1: Hardware Rig Design

Design your sensor capture rig. Use a NVIDIA Jetson Orin NX as edge compute. Mount sensors:
- 2x Intel RealSense D435i (stereo RGB-D, IR dot projection depth)
- 2x wide-angle USB wrist cameras (Arducam IMX519 or similar)
- 1x IMU array: 6x BNO085 distributed on body segments (chest, pelvis, upper arms, forearms)
- Tactile gloves: SynTouch BioTac or custom flex sensor array with 16 pressure points per finger

Synchronize all streams using hardware GPIO trigger pulse at 30Hz. Write a ROS2 node that captures timestamps from a shared hardware clock (PPS signal from GPS module for sub-millisecond precision). Store raw streams in rosbag2 format with compressed JPEG for RGB and 16-bit PNG for depth.

## Step 2: Database Schema

Build contributor, collection_sessions, sensor_streams, annotations, dataset_releases, and licenses tables in PostgreSQL with UUID primary keys and proper foreign key relationships.

## Step 3: Edge Ingestion API

Build a FastAPI service for rigs to upload sessions after collection completes. Queue QC jobs via SQS. Store raw data in S3 with 30-day hot tier, then Glacier for long-term archival.

## Step 4: QA Pipeline

Automated quality checks: timestamp continuity (max gap < 100ms), IMU completeness (all 6 segments present), depth quality (>70% valid pixels), minimum session duration (30 seconds).

## Step 5: 3D Pose and Hand Reconstruction

MANO hand reconstruction from RGB using MediaPipe plus MANO model fitting. ORB-SLAM3 for environment mapping and full body pose reconstruction.

## Step 6: Annotation Interface and Dataset Packaging

Deploy Label Studio with custom templates for task labeling, SAM2 auto-segmentation, environment classification, and quality rating. Package releases with DVC for dataset versioning.

## Step 7: Customer Portal and Licensing

Next.js portal with Supabase backend. Stripe licensing: research at $15,000/year per dataset, commercial at $150,000/year. Time-limited S3 presigned URLs for download with license middleware.

## Deployment

- Edge rig: NVIDIA Jetson Orin NX with 4G modem, systemd service for auto-start
- Backend: AWS ECS Fargate, RDS PostgreSQL, S3 for all storage
- Processing: AWS Batch for MANO reconstruction and SLAM jobs (GPU instances)
- Portal: Vercel for Next.js frontend, Supabase for auth and database queries
- Monitoring: Grafana plus CloudWatch for rig health telemetry, session failure rates, and QC pass rates

Install for: