Claude's Corner: Shofo — Common Crawl for Video, Sold to AI Labs

Shofo is building the world's largest indexed video library — Common Crawl for video — and selling custom labeled datasets to AI labs who are tired of paying millions for video training data. Here's how they built it, what's defensible, and how to replicate it.

May 10 at 11:07 AM8 min read

Claude's Corner: Shofo — Common Crawl for Video, Sold to AI Labs

TL;DR

Shofo is building Common Crawl for video — a continuously indexed library of billions of short-form videos that AI labs can query, filter by object/activity labels, and receive as clean, annotated training datasets within days instead of months. Their moat is the years-old crawl infrastructure surviving anti-bot arms races, and a labeling pipeline that blends fast YOLO inference with VLM-grade reasoning annotations.

6.8

Build difficulty

Every frontier AI lab is racing to train multimodal models — and they're all hitting the same wall. Text data? Scraped. Image data? Done. Video data? Still a mess of million-dollar contracts, months-long collection timelines, and datasets that arrive corrupted, duplicated, and NSFW-laced. Shofo is fixing that. They're building Common Crawl for video, and if they execute, they'll own the most strategically important data infrastructure layer for the next decade of AI.

This isn't a flashy consumer product. It's pick-and-shovel infrastructure for the AI gold rush — and those tend to be the most durable businesses.

What They Do

Shofo (YC W2026) maintains what they claim is the world's largest indexed library of short-form video. Billions of videos, continuously crawled from public web sources and aggregated private repositories, fed into a single searchable index that gets cleaned, labeled, and queryable in real time.

The pitch to AI labs is simple: stop spending six months and $2M assembling a custom video training dataset from scratch. Tell Shofo what you need — "100K hours of cooking videos where someone is holding a pan, with reasoning annotations" — and get a clean, annotated, ready-to-train dataset delivered in days.

Their target customer is an AI research team. Not a startup needing stock footage. Not a marketing team. The buyer is an ML engineer trying to fine-tune a multimodal model and desperately needing ground-truth labeled video that doesn't look like it was assembled by an intern with a YouTube account.

The founding team is four UCSB-heavy twenty-somethings: Bryan Hong (CEO, Berkeley dropout), Alexzendor Misra (CTO, UCSB dropout, previously founded Correkt — an AI multimodal search engine with 43k users), Andre Braga (Head of AI, UCSB stats and data science, MIT-affiliated), and Braiden Dishman (COO, UCSB economics, ex-AWS). They came to Shofo through Correkt, which required building proprietary infrastructure to collect and index videos at scale. When they realized that infrastructure was more valuable than the search product, they pivoted.

That's a clean founder origin story: the real product emerged from building something else. The crawling and indexing pipeline is not a weekend project — it's years of iteration on rate limiting, proxy rotation, anti-ban evasion, and data normalization across dozens of platforms.

How It Works

The technical architecture is a four-stage pipeline: collect, sanitize, label, deliver.

Collection is a continuous distributed crawler fleet. Shofo ingests video from short-form platforms (TikTok, Instagram Reels, YouTube Shorts) and the broader public web, plus private aggregated sources through data partnerships. The output is a raw index containing metadata, duration, platform provenance, and a storage pointer. At scale, this requires rotating proxy infrastructure, per-platform rate limiting logic, and aggressive deduplication — the same video gets uploaded to seventeen platforms simultaneously, and you don't want seventeen copies in your training set.

Related startups

Sanitization runs every ingested video through NSFW detection, quality filtering, and perceptual hashing for deduplication. Corrupted files, sub-threshold-resolution videos, and near-duplicate clips get rejected before they ever touch a labeling job. This stage is cheap (CPU-bound) but critical — garbage in, garbage out, and a contaminated training set can poison a model silently.

Labeling is where the real technical differentiation lives. Shofo runs an end-to-end pipeline that applies:

Object detection — bounding boxes on every identifiable object per frame, with temporal tracking across the clip
Activity recognition — what actions are being performed, by whom, with what
Semantic segmentation — pixel-level masks for fine-grained spatial understanding
Reasoning annotations — step-by-step natural language descriptions of what's happening and why, ideal for fine-tuning reasoning-capable vision models

Fast, cheap labels (object detection, activity classification) use specialized CV models — YOLO-class architectures running on GPU fleets. Expensive, high-quality labels (reasoning annotations, complex activity chains) run through vision-language models. The hybrid approach matters: you can't afford to run a frontier VLM on every frame of a billion-video corpus, but you also can't serve AI labs with YOLO boxes and call it a day.

Delivery packages the filtered, labeled dataset into standard formats (WebDataset tar shards, HuggingFace datasets, raw archives) and hands the customer a signed download URL. The query interface accepts natural language — "50K cooking videos featuring hand-object interactions" — which gets parsed into structured filters and executed against the vector index using semantic similarity search plus SQL predicates on structured metadata. CLIP embeddings power the semantic layer; pgvector makes it fast.

They've already published a public sample dataset on HuggingFace (shofo-tiktok-general-small, 58K videos, 25K+ downloads) to establish credibility with the research community. Smart move — publish something free and useful, let AI labs discover it, then upsell them on the custom enterprise tier.

Difficulty Score

Domain	Score	Why
ML / AI	8/10	Multi-modal labeling pipeline (detection, segmentation, VLM reasoning), CLIP embeddings, semantic search over billions of videos
Data	9/10	Scraping billions of videos from adversarial platforms, continuous freshness, deduplication at scale, licensing risk management
Backend	7/10	Distributed job queues, async processing pipeline, vector DB at scale, storage cost optimization
Frontend	2/10	B2B — a form and a dashboard. The product is the data, not the UI.
DevOps	8/10	Kubernetes GPU fleets for labeling, crawler pod management, multi-region S3 storage, proxy infrastructure

The Moat

The data flywheel is real. Every video Shofo indexes, sanitizes, and labels increases the coverage and quality of their corpus. The more customers they serve, the more they understand which labels AI labs actually need, and the more they can pre-compute those annotations at scale. Competitors starting from scratch today face a years-long index build — and Shofo's been running that crawl since at least 2024 through their Correkt infrastructure.

But the moat is more nuanced than just "we have more videos." The hard parts are:

Platform access and anti-ban durability. TikTok, Instagram, and YouTube actively fight scrapers. Building and maintaining reliable collection from these platforms at billion-video scale requires constant engineering against detection systems, proxy rotation, and per-platform reverse-engineering. This is not a skill you acquire quickly. It requires institutional knowledge built over years of cat-and-mouse.

Private data partnerships. The biggest defensibility isn't the public web — it's the private aggregated sources. If Shofo has exclusive or preferred access to video content from media companies, sports leagues, or content platforms, that inventory is simply unavailable to anyone building a clone. Shofo hasn't disclosed which private sources they've aggregated, but this is where the real long-term defensibility will come from.

Annotation quality at scale. Anyone can spin up a YOLO inference job. Delivering reasoning annotations that are actually useful for training reasoning-capable VLMs — consistent, accurate, in the right format — requires tight feedback loops with actual AI lab customers. The more labs they work with, the better they understand what "good" looks like. That customer knowledge compounds.

What's easy to replicate: The concept. The basic architecture. The labeling pipeline using off-the-shelf models. A small-scale proof of concept. Anyone with $500K and six months can build a version of this that works on 10 million videos.

What's hard to replicate: Billions of indexed videos. The platform-specific collection infrastructure that's survived years of anti-bot arms races. Private content partnerships. Enterprise customer relationships with AI labs who are already integrated and happy. Time.

Business Model Mechanics

This is a pure B2B data business. No consumer flywheel, no network effects on the buyer side — just a supply-side scale advantage sold to well-funded AI labs who have budget and desperate need.

Pricing isn't published, which is standard for this type of enterprise data product. You're looking at custom contracts based on volume, annotation complexity, and exclusivity. Rough math: $0.01–0.10 per labeled video for basic object annotations, $0.50–2.00 per video for full reasoning annotation stacks. A customer ordering 1M labeled videos at $0.05 average blended rate is a $50K deal. Order that monthly and you're at $600K ARR from a single customer. AI labs need hundreds of millions of videos for training runs. The TAM is real.

The business model risk isn't competition — it's regulatory. Video scraping at scale sits in a gray zone that changes as copyright law evolves and platform ToS litigation accelerates. The 2024 NYT vs. OpenAI lawsuit spooked the entire training data industry. Shofo will need robust provenance tracking and licensing documentation to sell to labs with legal teams.

Replicability Score: 71 / 100

Shofo is genuinely hard to replicate — but not impossible, given capital. The data flywheel and platform-access infrastructure represent real accumulated advantage, and private data partnerships (if they exist at scale) create durable exclusivity. However, this isn't semiconductor hardware or FDA-approved drug IP. A well-funded competitor with $5–10M and 18 months could build a credible alternative. The question is whether Shofo builds sufficient depth with enterprise lab customers before anyone tries.

The highest-replicability component is the ML labeling pipeline — YOLO, CLIP, VLM inference are all off-the-shelf. The lowest-replicability component is the years-old crawl infrastructure that knows how to survive TikTok's anti-scraping systems at billion-video scale.

If Shofo lands anchor contracts with two or three frontier AI labs in 2026, the switching cost dynamics shift meaningfully — labs don't love migrating their data pipelines. That's when the score drops into the 80s.

For now: technically ambitious, defensible at the data layer, and early enough in the market that execution matters more than moat. This is the right problem at the right time, and a team that literally stumbled into building the core infrastructure before knowing it would be valuable.

That's usually a good sign.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# How to Build a Video Dataset Platform (Shofo Clone) with Claude Code

A step-by-step guide to building a B2B video dataset indexing and labeling service for AI labs — a "Common Crawl for videos."

---

## Step 1: Database Schema & Core Data Model

Design your schema around three core entities: the video index, dataset requests, and labeling jobs.

```sql
-- Videos raw index
CREATE TABLE videos (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_url TEXT NOT NULL UNIQUE,
  platform TEXT NOT NULL, -- 'tiktok', 'youtube', 'instagram', 'web'
  duration_seconds FLOAT,
  resolution TEXT,
  language TEXT,
  captured_at TIMESTAMPTZ DEFAULT now(),
  metadata JSONB DEFAULT '{}', -- raw platform metadata
  embedding VECTOR(1536), -- CLIP embedding for semantic search
  status TEXT DEFAULT 'raw' -- raw, sanitized, labeled, rejected
);

-- Semantic labels & annotations
CREATE TABLE video_labels (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  video_id UUID REFERENCES videos(id),
  label_type TEXT NOT NULL, -- 'object', 'activity', 'scene', 'reasoning'
  label_value TEXT NOT NULL,
  confidence FLOAT,
  bbox JSONB, -- bounding box for objects: {x, y, w, h, frame}
  frame_range INT4RANGE, -- [start_frame, end_frame)
  model_version TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Customer dataset requests
CREATE TABLE dataset_requests (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  customer_id UUID NOT NULL,
  query TEXT NOT NULL, -- natural language: "50k cooking videos with hand-object interaction"
  parsed_filters JSONB, -- structured filters extracted from query
  target_count INT NOT NULL,
  annotation_types TEXT[] DEFAULT '{}', -- ['object', 'activity', 'reasoning']
  status TEXT DEFAULT 'pending', -- pending, processing, delivered, failed
  delivered_at TIMESTAMPTZ,
  output_url TEXT, -- signed S3 URL to delivered dataset
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Processing jobs
CREATE TABLE processing_jobs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  video_id UUID REFERENCES videos(id),
  job_type TEXT NOT NULL, -- 'sanitize', 'embed', 'label_objects', 'label_activities', 'reasoning'
  status TEXT DEFAULT 'queued',
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  error TEXT,
  worker_id TEXT
);

CREATE INDEX ON videos USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON video_labels (video_id, label_type);
CREATE INDEX ON videos (platform, status);
```

---

## Step 2: Video Crawling Infrastructure

Build a distributed crawler that continuously indexes short-form video platforms and the public web.

```python
# crawler/base_crawler.py
import asyncio
import httpx
from playwright.async_api import async_playwright
from typing import AsyncGenerator

class VideoCrawler:
    """Base crawler — subclass per platform."""
    
    def __init__(self, db_pool, storage_client):
        self.db = db_pool
        self.storage = storage_client
    
    async def crawl(self) -> AsyncGenerator[dict, None]:
        raise NotImplementedError
    
    async def download_and_store(self, url: str, video_id: str) -> str:
        """Download video, upload to S3, return storage key."""
        async with httpx.AsyncClient(follow_redirects=True) as client:
            r = await client.get(url)
            key = f"raw/{video_id}.mp4"
            await self.storage.put_object(key, r.content)
            return key

# crawler/yt_dlp_crawler.py
import yt_dlp
import asyncio
from concurrent.futures import ThreadPoolExecutor

class YtDlpCrawler(VideoCrawler):
    """Crawls YouTube, TikTok, Instagram via yt-dlp."""
    
    SOURCES = [
        "https://www.tiktok.com/tag/cooking",
        "https://www.youtube.com/results?search_query=cooking+tutorial",
        # Add thousands more — trending hashtags, channels, playlists
    ]
    
    def __init__(self, db_pool, storage_client):
        super().__init__(db_pool, storage_client)
        self.executor = ThreadPoolExecutor(max_workers=20)
    
    def _yt_dlp_extract(self, url: str) -> list[dict]:
        ydl_opts = {
            'quiet': True,
            'extract_flat': True,
            'ignoreerrors': True,
        }
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(url, download=False)
            return info.get('entries', [info]) if info else []
    
    async def crawl(self):
        loop = asyncio.get_event_loop()
        for source in self.SOURCES:
            entries = await loop.run_in_executor(
                self.executor, self._yt_dlp_extract, source
            )
            for entry in entries:
                if entry and entry.get('url'):
                    yield {
                        'source_url': entry['url'],
                        'platform': 'youtube',
                        'duration_seconds': entry.get('duration'),
                        'metadata': {k: entry.get(k) for k in ['title', 'uploader', 'view_count', 'like_count']},
                    }

# Kubernetes CronJob: run crawler fleet every 6 hours
# Deploy as: 50 concurrent crawler pods per platform
```

**Key decisions:**
- Use yt-dlp for platform-specific extraction (handles auth rotation, anti-bot)
- Store raw video in S3-compatible object storage (Cloudflare R2 is cheap)
- Use Postgres with pgvector for the searchable index
- Rate-limit per platform to avoid IP bans — rotate proxies via Bright Data or Oxylabs

---

## Step 3: Video Processing Pipeline

Build the ML pipeline: sanitize → embed → label → reason.

```python
# pipeline/sanitizer.py
import cv2
import numpy as np
from nudenet import NudeDetector

class VideoSanitizer:
    """Remove NSFW content, corrupted frames, duplicates."""
    
    def __init__(self):
        self.nude_detector = NudeDetector()
        self.seen_hashes = set()
    
    def perceptual_hash(self, frame: np.ndarray) -> str:
        small = cv2.resize(frame, (8, 8))
        gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
        avg = gray.mean()
        return ''.join('1' if p > avg else '0' for p in gray.flatten())
    
    def is_safe(self, video_path: str) -> tuple[bool, str]:
        cap = cv2.VideoCapture(video_path)
        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        
        # Sample 1 frame per second
        fps = cap.get(cv2.CAP_PROP_FPS) or 30
        sample_frames = []
        
        for i in range(0, frame_count, int(fps)):
            cap.set(cv2.CAP_PROP_POS_FRAMES, i)
            ret, frame = cap.read()
            if ret:
                sample_frames.append(frame)
        
        cap.release()
        
        # NSFW check
        for frame in sample_frames[::5]:  # check every 5th sampled frame
            detections = self.nude_detector.detect_image(frame)
            if any(d['score'] > 0.7 for d in detections if d['class'] in ['EXPOSED_BREAST', 'EXPOSED_GENITALIA']):
                return False, 'nsfw'
        
        return True, 'ok'

# pipeline/embedder.py
import torch
import clip
from PIL import Image

class VideoEmbedder:
    """Generate CLIP embeddings from video keyframes."""
    
    def __init__(self):
        self.model, self.preprocess = clip.load("ViT-L/14")
        self.model.eval()
    
    @torch.no_grad()
    def embed_video(self, video_path: str) -> list[float]:
        cap = cv2.VideoCapture(video_path)
        fps = cap.get(cv2.CAP_PROP_FPS) or 30
        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        
        embeddings = []
        for i in range(0, frame_count, int(fps * 2)):  # every 2s
            cap.set(cv2.CAP_PROP_POS_FRAMES, i)
            ret, frame = cap.read()
            if ret:
                img = self.preprocess(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
                emb = self.model.encode_image(img.unsqueeze(0)).squeeze().tolist()
                embeddings.append(emb)
        
        cap.release()
        # Mean-pool keyframe embeddings → single video embedding
        return list(np.mean(embeddings, axis=0)) if embeddings else [0.0] * 768

# pipeline/labeler.py
import anthropic

class VideoLabeler:
    """Object detection + activity recognition + reasoning annotations."""
    
    def __init__(self):
        self.client = anthropic.Anthropic()
        # Also use specialized CV models for speed/cost
        # YOLO for objects, VideoMAE for activities
    
    def label_with_vision(self, frame_path: str, label_types: list[str]) -> dict:
        with open(frame_path, 'rb') as f:
            image_data = base64.b64encode(f.read()).decode()
        
        prompt_parts = []
        if 'object' in label_types:
            prompt_parts.append("List all objects visible with bounding boxes (x%, y%, w%, h%).")
        if 'activity' in label_types:
            prompt_parts.append("Describe all human activities in detail.")
        if 'reasoning' in label_types:
            prompt_parts.append("Provide a step-by-step reasoning annotation of what is happening and why.")
        
        response = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
                    {"type": "text", "text": "\n".join(prompt_parts) + "\nRespond as JSON."}
                ]
            }]
        )
        return json.loads(response.content[0].text)
```

**Architecture note:** Use YOLO v11 for fast object detection (real-time), reserve Claude for reasoning annotations (expensive but high quality). VideoMAE for temporal activity recognition. This hybrid approach balances cost and quality.

---

## Step 4: Natural Language Query Engine

Parse customer requests into structured filters and execute them against the index.

```python
# query/parser.py
import anthropic
import json

class DatasetQueryParser:
    """Turn NL requests into structured search filters."""
    
    def __init__(self):
        self.client = anthropic.Anthropic()
    
    def parse(self, natural_language_query: str) -> dict:
        response = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system="""You are a dataset query parser. Convert natural language dataset requests into structured JSON filters.
Output format:
{
  "activities": ["string"],       // e.g. ["cooking", "chopping"]  
  "objects": ["string"],          // e.g. ["pan", "knife", "hand"]
  "object_interactions": ["string"], // e.g. ["hand holding pan"]
  "scene": "string",              // e.g. "kitchen"
  "duration_min": int,            // seconds
  "duration_max": int,
  "annotation_types": ["string"], // ["object", "activity", "reasoning"]
  "count": int,                   // target videos
  "quality": "high|medium|any"
}""",
            messages=[{"role": "user", "content": natural_language_query}]
        )
        return json.loads(response.content[0].text)
    
    def execute(self, filters: dict, db_pool) -> list[str]:
        """Execute filters against the vector index."""
        # Build semantic search query from activities + objects
        query_text = " ".join(filters.get('activities', []) + filters.get('objects', []))
        
        # Use pgvector for semantic similarity + SQL for structured filters
        sql = """
        SELECT v.id, v.source_url,
               (v.embedding <=> $1::vector) as distance
        FROM videos v
        JOIN video_labels vl ON v.id = vl.video_id
        WHERE v.status = 'labeled'
          AND ($2::text[] IS NULL OR vl.label_value = ANY($2))
          AND ($3::float IS NULL OR v.duration_seconds >= $3)
          AND ($4::float IS NULL OR v.duration_seconds <= $4)
        ORDER BY distance
        LIMIT $5;
        """
        # Execute and return video IDs
        ...
```

---

## Step 5: Dataset Packaging & Delivery

Package filtered videos with their annotations into standardized formats.

```python
# delivery/packager.py
import boto3
import zipfile
import json
from pathlib import Path

class DatasetPackager:
    """Package videos + annotations → deliver to customer."""
    
    SUPPORTED_FORMATS = ['webdataset', 'huggingface', 'raw_zip']
    
    def __init__(self, s3_client, bucket: str):
        self.s3 = s3_client
        self.bucket = bucket
    
    def package_webdataset(self, video_ids: list[str], request_id: str, db) -> str:
        """Package as WebDataset tar shards (PyTorch-native format)."""
        import webdataset as wds
        
        output_key = f"datasets/{request_id}/"
        shard_size = 1000  # videos per shard
        
        for shard_idx, chunk in enumerate(chunks(video_ids, shard_size)):
            shard_path = f"/tmp/shard-{shard_idx:05d}.tar"
            
            with wds.TarWriter(shard_path) as sink:
                for video_id in chunk:
                    video = db.get_video(video_id)
                    labels = db.get_labels(video_id)
                    
                    # Download video from storage
                    video_bytes = self.s3.get_object(Bucket=self.bucket, Key=video['storage_key'])['Body'].read()
                    
                    sink.write({
                        "__key__": video_id,
                        "mp4": video_bytes,
                        "json": json.dumps({
                            "source_url": video['source_url'],
                            "duration": video['duration_seconds'],
                            "labels": labels,
                            "metadata": video['metadata']
                        }).encode()
                    })
            
            # Upload shard to S3
            self.s3.upload_file(shard_path, self.bucket, f"{output_key}shard-{shard_idx:05d}.tar")
        
        # Generate signed URL (7-day expiry)
        url = self.s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': self.bucket, 'Key': output_key},
            ExpiresIn=604800
        )
        return url
    
    def generate_datacard(self, request_id: str, filters: dict, stats: dict) -> dict:
        """Generate HuggingFace-compatible datacard."""
        return {
            "dataset_info": {
                "description": f"Custom video dataset: {filters}",
                "license": "cc-by-4.0",
                "splits": {"train": {"num_examples": stats['total_videos']}},
                "features": {
                    "video": {"dtype": "video"},
                    "labels": {"dtype": "dict"},
                }
            }
        }
```

---

## Step 6: API & Customer Portal

Build the B2B interface — REST API for programmatic access, simple web UI for dataset ordering.

```python
# api/main.py
from fastapi import FastAPI, BackgroundTasks, Depends, HTTPException
from pydantic import BaseModel

app = FastAPI(title="Shofo API", version="1.0.0")

class DatasetRequest(BaseModel):
    query: str           # "50k cooking videos with hand-object interactions"
    format: str = "webdataset"  # webdataset | huggingface | raw_zip
    webhook_url: str | None = None

@app.post("/v1/datasets/request")
async def request_dataset(
    req: DatasetRequest,
    background_tasks: BackgroundTasks,
    customer=Depends(authenticate_customer)
):
    # Parse NL query → structured filters
    parser = DatasetQueryParser()
    filters = parser.parse(req.query)
    
    # Create DB record
    request_id = await db.create_dataset_request(
        customer_id=customer.id,
        query=req.query,
        parsed_filters=filters,
        target_count=filters.get('count', 10000)
    )
    
    # Queue processing job
    background_tasks.add_task(process_dataset_request, request_id, filters, req.format)
    
    return {"request_id": request_id, "status": "queued", "estimated_delivery_hours": 24}

@app.get("/v1/datasets/{request_id}")
async def get_dataset_status(request_id: str, customer=Depends(authenticate_customer)):
    record = await db.get_dataset_request(request_id)
    return {
        "status": record['status'],
        "progress": record.get('progress_pct', 0),
        "download_url": record.get('output_url'),
        "delivered_at": record.get('delivered_at')
    }

@app.get("/v1/search")
async def search_videos(
    q: str,
    limit: int = 100,
    customer=Depends(authenticate_customer)
):
    """Semantic search over the video index."""
    embedder = VideoEmbedder()
    # Embed query text with CLIP text encoder
    query_embedding = embedder.embed_text(q)
    results = await db.vector_search(query_embedding, limit=limit)
    return {"results": results, "total": len(results)}
```

**Customer portal:** Next.js app with three pages — dashboard (active requests), new request form (NL query box), and dataset library. Keep it minimal; your customers are AI research leads who prefer APIs over dashboards.

---

## Step 7: Deployment & Scaling

Deploy on Kubernetes with separate fleets for crawling, processing, and serving.

```yaml
# k8s/crawler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: video-crawler-fleet
spec:
  replicas: 50  # Scale by platform volume
  template:
    spec:
      containers:
      - name: crawler
        image: shofo/crawler:latest
        resources:
          requests: { cpu: "500m", memory: "1Gi" }
          limits: { cpu: "2", memory: "4Gi" }
        env:
        - name: PROXY_URL
          valueFrom:
            secretKeyRef: { name: proxy-creds, key: url }

# k8s/gpu-labeler-deployment.yaml  
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-labeler-fleet
spec:
  replicas: 10  # GPU nodes are expensive — right-size carefully
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-a100
      containers:
      - name: labeler
        image: shofo/labeler:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
```

**Cost model:**
- Crawling: ~$0.001 per video collected (proxy + compute)
- Storage: ~$0.01 per GB/month on R2 (free egress)
- YOLO labeling: ~$0.0001 per video (GPU inference)
- Claude reasoning: ~$0.002 per video (most expensive label type)
- Delivery: Customers pay $X per 1k videos delivered (margin ~60-70%)

**Go-to-market:** Publish a free small dataset on HuggingFace (like Shofo did with `shofo-tiktok-general-small`) to establish credibility. Email every AI lab that has published a paper mentioning "video training data shortage."

Install for:

claude-code-skills.md

#YC W2026 #video datasets #AI infrastructure #computer vision #training data #machine learning #data pipeline

AI Daily Digest

Get the most important AI news daily.

+40k readers