DeepSeek-V4: Million-Token Context is a Serving Problem

DeepSeek-V4's headline feature—a million-token context window—isn't just a leap in model capability; it's a fundamental shift that transforms large language model deployment into an inference systems problem. According to Together AI, the model's architectural innovations, particularly its hybrid attention design, compress context before KV cache storage. This reduces memory pressure but demands that inference engines adeptly manage resulting cache layouts, batch requests, and select appropriate endpoint profiles.

The core issue lies in the KV cache, which grows proportionally with sequence length and layers. At long contexts, this cache becomes a bottleneck, capping concurrency by consuming memory and slowing down inference by requiring constant reads. DeepSeek-V4 tackles this by reducing the number of cache entries and the amount of data moved during attention calculations.

The KV Cache Conundrum

Autoregressive inference necessitates storing prior context in a KV cache. As new tokens are generated, they attend to this stored state, leading to a cache size that scales with sequence length. For a 70B-class model, this cache can demand megabytes per token, making a million-token context impractical for a single request without significant optimization.

DeepSeek-V4's architecture, featuring Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and Sliding Window Attention (SWA), aims to mitigate this. CSA summarizes local neighborhoods, HCA provides a coarse global view through dense attention over compressed entries, and SWA maintains exact recent context. This creates a complex landscape of mixed cache types—compressed sparse, compressed dense, and local state—each with distinct sizes, lifetimes, and access patterns.

The practical gains from V4's architectural shifts are realized through the inference engine's ability to manage these diverse cache states. Together AI's early work demonstrated that by optimizing cache policies to prioritize reuse and eviction of SWA states, they could expand the total KV cache capacity on a single NVIDIA HGX B200 node from 1.2 million tokens to 3.7 million tokens.

Serving Regimes and Workload Splits

DeepSeek-V4's performance benefits are regime-dependent. Long-context, decode-heavy workloads, which spend significant time reading cache, see the most immediate gains from the compressed cache layout. This is where the model's million-token context design shines, benefiting applications like coding agents that process large codebases or research agents handling extensive documentation.

Conversely, short-context, prefill-heavy workloads are more sensitive to kernel maturity and prefill latency. Operations like CSA's top-k selection and HCA's compressed reads introduce complexities that require optimized kernels. For these shorter interactions, serving may necessitate different endpoint configurations, such as smaller tensor-parallel groups and minimal batching delays.

The same DeepSeek-V4 weights require distinct serving profiles depending on the workload. Long-context agents benefit from aggressive batching and prefix reuse, while short-context chat assistants require optimized kernels for rapid response. The challenge for platforms like Together AI inference is to provide flexible serving infrastructure that can cater to these varied demands.

Ultimately, realizing the potential of million-token context models hinges on inference systems capable of managing complex cache hierarchies and dynamic workload requirements. The architectural advancements are a starting point; the serving infrastructure is where the true value is unlocked.

DeepSeek-V4: Million-Token Context is a Serving Problem

The KV Cache Conundrum

Related startups

Serving Regimes and Workload Splits

AI Daily Digest