The inference market has a commoditization problem. Most "GPU cloud" companies rent H100s from CoreWeave, bolt an OpenAI-compatible endpoint on top of vLLM or TGI, and call it a product. Differentiation is a landing page. Moat is a blog post. The race to the bottom is already underway, and nobody particularly interesting is winning it.
Cumulus Labs is not playing that game. The YC W26 company is building a C++ inference engine, called Ion, written specifically for the NVIDIA GH200 Grace Hopper architecture, with custom CUDA kernels that get throughput numbers nobody else is publishing. They're posting benchmarks that make Together AI look slow: 588 tokens per second versus Together's 298 on the same Llama workload, same price. 7,167 tok/s total on a single GH200 chip. Cold starts in 12.5 seconds while Modal is still apologizing for its 70-second spin-up times.
