Databricks Tackles LLM Inference Costs

Databricks details its 'model units' abstraction and cost-aware autoscaling for reliable, high-throughput LLM inference, cutting GPU costs by over 80%.

May 28 at 12:46 AM8 min read

Diagram illustrating Databricks' LLM inference architecture with data and control planes. — A high-level view of Databricks' serving infrastructure for large language models.

Visual TL;DR. LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs. Model Units Abstraction leads to Runtime Reliability. Cost-Aware Autoscaling leads to Runtime Reliability. Runtime Reliability leads to High-Throughput Inference. Reduced GPU Costs leads to High-Throughput Inference.

LLM Inference Challenges: unpredictable spikes, latency control, hardware unreliability
High GPU Costs: overprovisioning and multi-AZ deployments are prohibitively expensive
Databricks Platform: serving over 120 trillion tokens monthly for clients
Model Units Abstraction: a new abstraction for managing LLM resources
Cost-Aware Autoscaling: optimizes GPU usage during fluctuating demand
Runtime Reliability: ensuring consistent and dependable LLM serving
Reduced GPU Costs: cutting costs by over 80 percent
High-Throughput Inference: enabling efficient serving of large language models

Visual TL;DRQuickExplainDeeper

Serving large language models (LLMs) at scale presents a formidable challenge, demanding both unwavering reliability and stringent latency control. As applications increasingly rely on AI agents, inference demand is exploding, characterized by sharp, unpredictable spikes during peak hours. Databricks has been building a robust platform to handle this, serving over 120 trillion tokens monthly for clients ranging from Superhuman to Fox Sports. The core hurdle, as detailed in their engineering blog, lies in making LLM serving consistently dependable.

The complexity stems from the hardware itself. State-of-the-art LLM inference relies on cutting-edge GPUs with high-bandwidth interconnects, components that are inherently less reliable than traditional CPUs. Failures in these systems can have a wide blast radius, and standard distributed systems resilience tactics like multi-AZ deployments are prohibitively expensive due to idle GPU costs. Overprovisioning is similarly impractical given compute supply constraints.

The Latency Tightrope

Maintaining low latency is critical, especially for advanced agents that cannot tolerate delays in time-to-first-token or output token generation. This becomes a balancing act: faster throughput often means higher costs, while striving for the absolute lowest latency can strain server resources. Even healthy servers slow down under heavy load, and certain request mixes can push them into unhealthy states unexpectedly.

Introducing Model Units

To tame this complexity, Databricks developed an abstraction called "model units." This approach provides a VM-like way to allocate, route, and scale GPU resources per customer. By projecting a replica's processing capacity in model units per minute, the system can account for variable request costs, longer inputs or outputs consume more units. This allows for predictable capacity allocation, a crucial feature for production agentic workloads demanding low latency and guaranteed capacity.

Cost-Aware Operations

Traditional load balancing heuristics fall short for LLMs. Databricks leverages a system called Dicer, which dynamically routes workloads based on server load measured in model units, not just traditional request counts. This load-aware routing prevents hotspots caused by long-context requests and improves cache hit rates through stateful sessions, limiting the blast radius of failures. The autoscaler also uses model unit utilization to trigger scaling events, ensuring resources align with actual demand rather than just pending requests. This combination of cost-based load balancing and autoscaling reportedly saved over 80% in GPU costs compared to static provisioning.

Runtime Reliability is Non-Negotiable

Beyond smart routing and scaling, Databricks implements runtime reliability mechanisms. Silent hangs, where requests trigger unhandled errors and servers stop responding, are detected via periodic black-box health checks. To prevent these checks from failing under load, they are assigned the highest scheduling priority, ensuring timely detection and recovery. This system, which restarts unhealthy servers via Kubernetes liveness probes, brings the recovery cycle to under five minutes and has eliminated false probe failures.

Even multimodal requests, which can be significantly more resource-intensive, are being optimized. Investigations into spikes in error rates from image requests revealed bottlenecks upstream of the core inference processes, necessitating further investigation into preprocessing systems.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#LLM #Inference #Databricks #GPU #Autoscaling #Reliability #AI Infrastructure #Cloud Computing