Databricks Tackles LLM Inference Costs

Databricks details its 'model units' abstraction and cost-aware autoscaling for reliable, high-throughput LLM inference, cutting GPU costs by over 80%.

8 min read
Diagram illustrating Databricks' LLM inference architecture with data and control planes.
A high-level view of Databricks' serving infrastructure for large language models.

Serving large language models (LLMs) at scale presents a formidable challenge, demanding both unwavering reliability and stringent latency control. As applications increasingly rely on AI agents, inference demand is exploding, characterized by sharp, unpredictable spikes during peak hours. Databricks has been building a robust platform to handle this, serving over 120 trillion tokens monthly for clients ranging from Superhuman to Fox Sports. The core hurdle, as detailed in their engineering blog, lies in making LLM serving consistently dependable.

Visual TL;DR. LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs. Model Units Abstraction leads to Runtime Reliability. Cost-Aware Autoscaling leads to Runtime Reliability. Runtime Reliability leads to High-Throughput Inference. Reduced GPU Costs leads to High-Throughput Inference.

Related startups

  1. LLM Inference Challenges: unpredictable spikes, latency control, hardware unreliability
  2. High GPU Costs: overprovisioning and multi-AZ deployments are prohibitively expensive
  3. Databricks Platform: serving over 120 trillion tokens monthly for clients
  4. Model Units Abstraction: a new abstraction for managing LLM resources
  5. Cost-Aware Autoscaling: optimizes GPU usage during fluctuating demand
  6. Runtime Reliability: ensuring consistent and dependable LLM serving
  7. Reduced GPU Costs: cutting costs by over 80 percent
  8. High-Throughput Inference: enabling efficient serving of large language models
Visual TL;DR
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs enables enables LLM Inference Challenges High GPU Costs Databricks Platform Model Units Abstraction Cost-Aware Autoscaling Reduced GPU Costs From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs enables enables LLM InferenceChallenges High GPU Costs DatabricksPlatform Model UnitsAbstraction Cost-AwareAutoscaling Reduced GPU Costs From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs enables enables LLM Inference Challenges unpredictable spikes, latency control,hardware unreliability High GPU Costs overprovisioning and multi-AZ deploymentsare prohibitively expensive Databricks Platform serving over 120 trillion tokens monthlyfor clients Model Units Abstraction a new abstraction for managing LLMresources Cost-Aware Autoscaling optimizes GPU usage during fluctuatingdemand Reduced GPU Costs cutting costs by over 80 percent From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs enables enables LLM InferenceChallenges unpredictablespikes, latencycontrol, hardware… High GPU Costs overprovisioningand multi-AZdeployments are… DatabricksPlatform serving over 120trillion tokensmonthly for clients Model UnitsAbstraction a new abstractionfor managing LLMresources Cost-AwareAutoscaling optimizes GPU usageduring fluctuatingdemand Reduced GPU Costs cutting costs byover 80 percent From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs. Model Units Abstraction leads to Runtime Reliability. Cost-Aware Autoscaling leads to Runtime Reliability. Runtime Reliability leads to High-Throughput Inference. Reduced GPU Costs leads to High-Throughput Inference enables enables LLM Inference Challenges unpredictable spikes, latency control,hardware unreliability High GPU Costs overprovisioning and multi-AZ deploymentsare prohibitively expensive Databricks Platform serving over 120 trillion tokens monthlyfor clients Model Units Abstraction a new abstraction for managing LLMresources Cost-Aware Autoscaling optimizes GPU usage during fluctuatingdemand Runtime Reliability ensuring consistent and dependable LLMserving Reduced GPU Costs cutting costs by over 80 percent High-Throughput Inference enabling efficient serving of largelanguage models From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai LLM Inference Challenges leads to High GPU Costs. LLM Inference Challenges leads to Databricks Platform. Databricks Platform leads to Model Units Abstraction. Databricks Platform leads to Cost-Aware Autoscaling. Model Units Abstraction enables Reduced GPU Costs. Cost-Aware Autoscaling enables Reduced GPU Costs. Model Units Abstraction leads to Runtime Reliability. Cost-Aware Autoscaling leads to Runtime Reliability. Runtime Reliability leads to High-Throughput Inference. Reduced GPU Costs leads to High-Throughput Inference enables enables LLM InferenceChallenges unpredictablespikes, latencycontrol, hardware… High GPU Costs overprovisioningand multi-AZdeployments are… DatabricksPlatform serving over 120trillion tokensmonthly for clients Model UnitsAbstraction a new abstractionfor managing LLMresources Cost-AwareAutoscaling optimizes GPU usageduring fluctuatingdemand RuntimeReliability ensuring consistentand dependable LLMserving Reduced GPU Costs cutting costs byover 80 percent High-ThroughputInference enabling efficientserving of largelanguage models From startuphub.ai · The publishers behind this format

The complexity stems from the hardware itself. State-of-the-art LLM inference relies on cutting-edge GPUs with high-bandwidth interconnects, components that are inherently less reliable than traditional CPUs. Failures in these systems can have a wide blast radius, and standard distributed systems resilience tactics like multi-AZ deployments are prohibitively expensive due to idle GPU costs. Overprovisioning is similarly impractical given compute supply constraints.

The Latency Tightrope

Maintaining low latency is critical, especially for advanced agents that cannot tolerate delays in time-to-first-token or output token generation. This becomes a balancing act: faster throughput often means higher costs, while striving for the absolute lowest latency can strain server resources. Even healthy servers slow down under heavy load, and certain request mixes can push them into unhealthy states unexpectedly.

Introducing Model Units

To tame this complexity, Databricks developed an abstraction called "model units." This approach provides a VM-like way to allocate, route, and scale GPU resources per customer. By projecting a replica's processing capacity in model units per minute, the system can account for variable request costs—longer inputs or outputs consume more units. This allows for predictable capacity allocation, a crucial feature for production agentic workloads demanding low latency and guaranteed capacity.

Cost-Aware Operations

Traditional load balancing heuristics fall short for LLMs. Databricks leverages a system called Dicer, which dynamically routes workloads based on server load measured in model units, not just traditional request counts. This load-aware routing prevents hotspots caused by long-context requests and improves cache hit rates through stateful sessions, limiting the blast radius of failures. The autoscaler also uses model unit utilization to trigger scaling events, ensuring resources align with actual demand rather than just pending requests. This combination of cost-based load balancing and autoscaling reportedly saved over 80% in GPU costs compared to static provisioning.

Runtime Reliability is Non-Negotiable

Beyond smart routing and scaling, Databricks implements runtime reliability mechanisms. Silent hangs, where requests trigger unhandled errors and servers stop responding, are detected via periodic black-box health checks. To prevent these checks from failing under load, they are assigned the highest scheduling priority, ensuring timely detection and recovery. This system, which restarts unhealthy servers via Kubernetes liveness probes, brings the recovery cycle to under five minutes and has eliminated false probe failures.

Even multimodal requests, which can be significantly more resource-intensive, are being optimized. Investigations into spikes in error rates from image requests revealed bottlenecks upstream of the core inference processes, necessitating further investigation into preprocessing systems.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.