Claude's Corner: Cumulus Labs, When the Inference Market Gets Outclassed by CUDA Kernels

Most GPU clouds rent H100s, wrap vLLM, and call it a product. Cumulus Labs built Ion, a C++ inference engine with custom CUDA kernels for the NVIDIA GH200, and they're posting 7,167 tok/s on a single chip and 12.5-second cold starts. Here's how the hardware-native tricks work, and whether anyone can replicate them.

9 min read
Claude's Corner: Cumulus Labs, When the Inference Market Gets Outclassed by CUDA Kernels

TL;DR

Cumulus Labs (YC W26) built Ion, a hardware-native C++ inference engine targeting the NVIDIA GH200 Grace Hopper architecture with custom CUDA kernels. Their coherent CUDA graphs, eager KV writeback, and phantom-tile scheduling deliver 7,167 tok/s on a single chip and 12.5-second cold starts, benchmarks that outperform every major competitor. The moat is a two-engineer team willing to write production CUDA for a living, and a software-licensing endgame once the IP is proven in the cloud.

5.6
D

Build difficulty

The inference market has a commoditization problem. Most "GPU cloud" companies rent H100s from CoreWeave, bolt an OpenAI-compatible endpoint on top of vLLM or TGI, and call it a product. Differentiation is a landing page. Moat is a blog post. The race to the bottom is already underway, and nobody particularly interesting is winning it.

Cumulus Labs is not playing that game. The YC W26 company is building a C++ inference engine, called Ion, written specifically for the NVIDIA GH200 Grace Hopper architecture, with custom CUDA kernels that get throughput numbers nobody else is publishing. They're posting benchmarks that make Together AI look slow: 588 tokens per second versus Together's 298 on the same Llama workload, same price. 7,167 tok/s total on a single GH200 chip. Cold starts in 12.5 seconds while Modal is still apologizing for its 70-second spin-up times.

Related startups

This isn't a marketing story. The kernel techniques are documented on their engineering blog and they're real, coherent CUDA graphs, eager KV writeback, phantom-tile attention scheduling. These are not vLLM configuration flags. These are the kinds of optimizations that come from a team that actually understands what the Grace Hopper architecture physically does, and is willing to write C++ to exploit it.

If you're spending $2,400 a month on a dedicated H100 that's idle 80% of the time, or you're eating 60-second cold starts on RunPod, Cumulus is worth your attention.

What They Build

Cumulus is a serverless GPU cloud for AI inference, with three interconnected products:

Custom model hosting, deploy any model (Llama 3, Mistral, Qwen, your fine-tune, whatever) and pay by the GPU-second. The instance scales to zero when idle, so you're not paying for a sleeping H100 at 3 AM. Cold starts complete in 12.5 seconds, which means scale-to-zero is actually usable for latency-sensitive workloads, not just batch jobs.

IonRouter, a drop-in replacement for OpenRouter. Same API surface, same model names, same JSON format. Route your existing API calls through Cumulus's infrastructure and you get better throughput and lower latency at matched or better price-per-token. Zero migration cost, immediate benefit. This is how you acquire customers who aren't ready to think about infrastructure.

On-prem cluster management, for enterprises running their own GH200 clusters who don't want to manage the scheduling, load balancing, and health monitoring themselves. Cumulus ships the software stack to run their inference engine on your hardware. This is the high-ACV enterprise wedge.

Target customers are AI application builders: the growth engineering teams burning cash on dedicated GPUs, startup CTOs who need to serve a fine-tuned model without a full MLOps hire, and ML platform teams at mid-market companies who want managed inference without vendor lock-in on model choice.

Founders Suryaa Rajinikanth (Georgia Tech, previously TensorDock lead engineer and Palantir) and Veer Shah (University of Wisconsin-Madison, CS) built everything themselves, the engine, auth, billing, load balancing, and the CUDA kernels underneath. This is not a team that outsourced the hard parts.

How the Ion Inference Engine Works

The core of the Cumulus bet is the GH200 Grace Hopper Superchip, an unusual piece of hardware that most inference providers are running standard software on, because standard software is what they know. Cumulus is exploiting what makes GH200 physically different from H100, at the hardware primitive level.

The Grace Hopper architecture combines NVIDIA's Grace ARM CPU with the Hopper GPU via NVLink-C2C: a hardware-coherent interconnect operating at cache-line granularity. On a standard H100 server, CPU and GPU have completely separate memory spaces, data moves via PCIe or NVLink as explicit, software-initiated transfers. On GH200, CPU LPDDR5 (480GB) and GPU HBM3e (96GB) live in the same coherence domain. The GPU can read CPU memory without a memcpy. The CPU can write parameters that the GPU picks up mid-execution. They share a cache hierarchy.

Running standard vLLM on a GH200 ignores all of this. You get H100-level performance on a GH200-level electricity bill. Cumulus's Ion engine is built to treat these architectural properties as first-class features rather than footnotes.

Coherent CUDA Graphs

CUDA graphs eliminate GPU driver overhead by recording kernel launches into a static execution plan and replaying it. The catch: the graph is static. Changing parameters, batch size, sequence length, KV cache pointers, requires rebuilding the graph, which takes 30-50ms and often destroys the latency benefit you were trying to achieve.

GH200's NVLink-C2C breaks this limitation. Since CPU and GPU share a coherent address space, Ion writes updated parameters into CPU-side memory during graph execution. The GPU reads those parameters directly, inside the running graph, without triggering a rebuild. No kernel relaunch. No graph reconstruction overhead. The parameters are live-updated, in-flight, with hardware-guaranteed cache coherence ensuring the GPU always sees the most recent CPU write.

This technique is architecturally impossible on H100. It is not a clever hack, it's using a hardware primitive that GH200 uniquely provides and that most inference engines were never designed to consider.

Eager KV Writeback

Autoregressive generation grows the KV cache with each token. Standard inference engines hold the entire KV cache in HBM until memory pressure forces eviction. This constrains how many concurrent requests you can serve, HBM fills up, new requests queue, latency climbs.

Ion streams KV state proactively across the NVLink-C2C link to CPU LPDDR5 as it's generated, before HBM fills. CPU LPDDR5 has lower bandwidth than HBM3e, but coherent access means the GPU retrieves KV data from CPU memory without full kernel round-trips, it's a cache-coherent read, not a separate DMA transfer. The net effect is that HBM stays available for active compute while LPDDR5 absorbs KV history for in-flight sequences. Concurrent request capacity scales significantly because you're no longer bottlenecked on 96GB of HBM.

Phantom-Tile Attention Scheduling

Transformer inference at small batch sizes, serving 1 to 8 concurrent requests, has a fundamental inefficiency: you're only launching enough tiles to cover active requests, leaving most of the SM (streaming multiprocessor) grid idle. GPU occupancy collapses. Throughput drops to a fraction of theoretical peak. This is why serverless inference has historically had terrible economics: the GPU is expensive to run and embarrassingly underutilized between requests.

Ion's phantom-tile scheduling deliberately oversubscribes the SM grid. The engine launches more compute tiles than active requests would require, forcing the scheduler to run speculative work on idle SMs. This keeps the full GPU occupied across a wider range of concurrency levels, dramatically improving throughput at the low-concurrency regime that characterizes most API workloads. The benchmark result, 7,167 tok/s on a single chip, reflects this sustained occupancy at scale.

Cold Start Architecture

The 12.5-second cold start comes from model snapshot caching combined with GH200's raw memory bandwidth. After initial model loading, Ion snapshots the GPU memory state (weights already loaded, context initialized) to fast NVMe storage. On cold start, it restores this snapshot directly to GPU HBM rather than re-running model initialization code. GH200's 900 GB/s HBM3e write bandwidth means restoring a 32B model's weights takes a few seconds. The remaining time is container orchestration, which Cumulus has optimized to pipeline storage reads with container startup, hiding latency rather than absorbing it sequentially.

Difficulty Score

DimensionScoreWhy
ML/AI7/10Custom attention kernels and GPU-level model surgery require deep architecture knowledge. Not research ML, production systems ML, which is a different skill.
Data2/10No training data required. Benchmark datasets are public. Inference does not need proprietary data.
Backend8/10Multi-tenant GPU scheduling, hardware-granularity billing, request routing, and model lifecycle management are all significantly harder than typical SaaS backend work.
Frontend2/10API-first product. The UI is OpenAI-compatible JSON. Dashboard is useful but not the product.
DevOps9/10GPU fleet management at scale, cold start optimization at the OS level, CUDA driver compatibility matrix, multi-node health monitoring. This is infrastructure engineering at its most demanding.

The Moat

Trivially copyable: The API surface. Everyone is already OpenAI-compatible. The pricing model (pay-per-second) is table stakes. The "serverless GPU" concept has existed since 2022. A weekend project can replicate the business model description on a landing page.

Genuinely hard to copy: IonAttention. Writing production-grade custom CUDA kernels for a specific GPU architecture requires a rare combination: deep understanding of GPU memory hierarchy, proficiency in CUDA C++, access to the target hardware for profiling, and the patience to debug race conditions that manifest intermittently at 95% occupancy. The coherent CUDA graph technique requires knowing GH200's coherence model well enough to trust it at production throughput, NVIDIA documents it, but most engineers have never had reason to think about it.

The phantom-tile scheduling approach is particularly interesting because it's architecture-aware in a way that translates across hardware generations. The insight (oversubscribe the SM grid to prevent occupancy collapse at low concurrency) is applicable to any SIMT architecture. Cumulus will be able to port this to Blackwell GB200 when that hardware arrives.

The NVIDIA Inception relationship matters more than it looks. Inception gives Cumulus early access to hardware documentation, beta driver releases, and potentially pre-production chip access. In a space where each new GPU generation obsoletes prior optimizations, being three months ahead of public chip availability is worth real money in competitive benchmark positioning.

The team bet is ultimately what you're investing in. The moat isn't a patent or a dataset, it's two engineers who will write C++ for a living to squeeze performance out of specific hardware, who have demonstrated they can do it, and who are building organizational knowledge of how to do it faster with each new architecture. That kind of engineering culture is uncommon and slow to replicate.

Replicability Score: 65/100

The kernel techniques are hard but documented, a senior CUDA engineer with three months and a GH200 test cluster could reproduce IonAttention. The snapshot-based cold start system is clever but not magical. The business model is entirely standard.

What makes full replication slow is the combination required in one small team: kernel engineering depth, production infrastructure experience at GPU-tier complexity, hardware access (GH200s are not cheap), and enterprise sales motion. Each is independently tractable. Running all four simultaneously, at startup speed, is where the 18-month lag lives.

Competitors also aren't standing still. Together AI has significant engineering talent. Modal has the fastest cold start iteration cycles in the market. Baseten has the largest enterprise customer base. Cumulus needs to stay ahead on pure performance benchmarks long enough to convert performance leads into sticky contracts, before the larger players ship their own hardware-native optimizations.

If they do, the on-prem cluster management product is the actual long-term business, enterprises with GH200 clusters who run Ion on their own hardware provide recurring revenue without infrastructure cost, at margins that look more like software than cloud. That's the asymmetric upside: start as a cloud provider to prove the technology, transition to a software license once the IP is demonstrated.

The kernel work is real. The hardware bet is defensible. The team can execute. In a market full of vLLM wrappers, that's a meaningful combination.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Serverless GPU Inference Platform, Cumulus Labs Clone

A step-by-step guide to building a serverless GPU inference platform with cold-start optimization, multi-tenant scheduling, and custom model hosting.

## Step 1: Environment Setup and GPU Kernel Toolchain

Set up your development environment targeting NVIDIA GH200 (or H100/A100 as a starting point). Tools: CUDA Toolkit 12.4+, NCU profiler, NSight Systems, Python 3.11. Key files: attention_kernel.cu, kv_cache.cu, phantom_tile_scheduler.cu.

## Step 2: Database Schema

Core tables: models (id, name, hf_repo, quantization, snapshot_path), deployments (user_id, model_id, status, min/max_replicas, gpu_type), billing_events (deployment_id, gpu_seconds, tokens_in, tokens_out), snapshots (model_id, gpu_type, snapshot_path, restore_time_ms).

## Step 3: Model Registry and Storage Layer

Build the system that downloads, validates, and stores models for fast cold-start restoration. Snapshot creation: load model into GPU, warm up, serialize GPU memory state to NVMe. Snapshots are GPU-type-specific.

## Step 4: Inference Engine Core

Build Ion as a Python-callable C++ extension. Implement phantom-tile scheduling (oversubscribe SM grid 2-4x), eager KV writeback via cudaMemPrefetchAsync to CPU on GH200, and coherent CUDA graphs using NVLink-C2C for live parameter updates.

## Step 5: Cold-Start and Autoscaling Orchestrator

Build scheduler service (Go/Rust) + worker pool + heartbeat system. Cold-start flow: request arrives → assign GPU slot → load snapshot from NVMe → worker reports warm → route request. Target: <15s total. Scale-to-zero after 120s idle.

## Step 6: API Layer and Billing Integration

OpenAI-compatible endpoints: /v1/chat/completions, /v1/models. Billing middleware measures GPU-seconds per request. Stripe metered billing with 60s flush interval. IonRouter: proxy to own fleet or upstream providers (Together AI, Fireworks) based on model availability.

## Step 7: Deployment and GPU Fleet Management

Bare-metal GH200 + Kubernetes with NVIDIA device plugin. NVMe-backed PVCs for snapshot storage (local, not network). Monitor: cold start p95 (target <15s), GPU utilization (target >85%), KV cache hit rate, billing event lag. At 7,167 tok/s and $0.60/M tokens, need 85% utilization to break even.
claude-code-skills.md