Databricks Speeds Up Open-Source LLMs

Databricks is rolling out automatic prompt caching for open-source large language models (LLMs) on its platform. This feature, previously available for proprietary models, aims to accelerate LLM inference by reusing identical prompt prefixes across requests. According to Databricks, this can dramatically cut down on wasted compute cycles, reduce latency, and increase overall throughput.

Visual TL;DR. Repeated LLM Prompts causes Automatic Prompt Caching. Databricks Platform implements Automatic Prompt Caching. Automatic Prompt Caching leads to Reduced Latency. Automatic Prompt Caching leads to Increased Throughput. Databricks Platform expands Feature Availability.

Repeated LLM Prompts: common in chatbots and batch tasks, leading to wasted compute cycles
Databricks Platform: introduces automatic prompt caching for open-source LLMs
Automatic Prompt Caching: reuses identical prompt prefixes across requests, skipping prefill stage
Reduced Latency: faster response times for LLM inference, improving user experience
Increased Throughput: more tokens processed per unit of compute, boosting efficiency
No User Configuration: feature works automatically without requiring manual setup
Feature Availability: now extends to open-source LLMs, not just proprietary ones

Visual TL;DRQuickExplainDeeper

!-- /sh-diagram -->

The core idea behind prompt caching is simple: why reprocess the same initial instructions or system prompts repeatedly? When a prompt prefix matches a cached entry, the LLM can skip the initial computation phase, known as the 'prefill' stage. This directly translates to lower latency and higher throughput, allowing more tokens to be processed per unit of compute.

Why It Matters

Repeated prompts are common in many LLM applications, from chatbots using consistent system messages to batch processing tasks with identical initial instructions. Without caching, these repeated computations are a significant performance bottleneck.

Prompt caching allows for the cost of long, domain-specific system prompts to be amortized across many queries, effectively boosting model quality in specific contexts without a proportional increase in inference cost. This is particularly relevant as research shows open-source models can now rival proprietary models in enterprise tasks through techniques like automated prompt optimization.

Feature Availability and Security

Databricks has extended its built-in prompt caching to a range of open-weights models available through their Foundation Model APIs (FMAPIs). This includes models like GPT-OSS 20B and 120B, Gemma 3 12B, and various Llama 3.1 and 3.3 configurations. The feature is available for batch inference, pay-per-token, and provisioned-throughput workloads, and it implicitly powers higher-level services like Agent Bricks and Genie.

Security remains a priority, with prompt caches isolated to volatile memory and never persisted. The caching is entirely automatic; users do not need to configure any settings for Databricks Prompt Caching to function, similar to how other solutions like Tensormesh exits stealth with $4.5M to slash AI inference caching costs operate.

Real-World Performance Gains

In early production tests on GPT-OSS models, Databricks observed substantial improvements. One large-scale batch inference pipeline saw a 2.5x increase in per-replica input-token throughput and a 3x reduction in P50 latency, even with a relatively modest cache hit ratio of 30%. This demonstrates the tangible impact of efficient caching.

By automatically reusing KV caches for identical prompts, Databricks enables faster, more cost-effective, and secure operation of open-source LLMs. This enhancement can significantly improve inference pipelines for various applications, from real-time chat to large-scale document processing.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.