Serving large language models (LLMs) at scale presents a formidable challenge, demanding both unwavering reliability and stringent latency control. As applications increasingly rely on AI agents, inference demand is exploding, characterized by sharp, unpredictable spikes during peak hours. Databricks has been building a robust platform to handle this, serving over 120 trillion tokens monthly for clients ranging from Superhuman to Fox Sports. The core hurdle, as detailed in their engineering blog, lies in making LLM serving consistently dependable.
Related startups
The complexity stems from the hardware itself. State-of-the-art LLM inference relies on cutting-edge GPUs with high-bandwidth interconnects, components that are inherently less reliable than traditional CPUs. Failures in these systems can have a wide blast radius, and standard distributed systems resilience tactics like multi-AZ deployments are prohibitively expensive due to idle GPU costs. Overprovisioning is similarly impractical given compute supply constraints.
The Latency Tightrope
Maintaining low latency is critical, especially for advanced agents that cannot tolerate delays in time-to-first-token or output token generation. This becomes a balancing act: faster throughput often means higher costs, while striving for the absolute lowest latency can strain server resources. Even healthy servers slow down under heavy load, and certain request mixes can push them into unhealthy states unexpectedly.
Introducing Model Units
To tame this complexity, Databricks developed an abstraction called "model units." This approach provides a VM-like way to allocate, route, and scale GPU resources per customer. By projecting a replica's processing capacity in model units per minute, the system can account for variable request costs—longer inputs or outputs consume more units. This allows for predictable capacity allocation, a crucial feature for production agentic workloads demanding low latency and guaranteed capacity.