Cloudflare's LLM Infrastructure Deep Dive

Cloudflare is detailing the engineering feats behind its Workers AI platform, which now hosts large open-source models like Moonshot’s Kimi K2.5. The company has already tripled Kimi K2.5's speed and is actively developing further model integrations.

Running massive AI models demands a careful balance of software and expensive hardware. Cloudflare leverages its expertise in hardware efficiency through sophisticated software engineering to tackle this challenge.

Hardware Configurations Tailored for LLMs

The optimal hardware setup for LLMs depends heavily on input and output token volumes. Use cases range from generating lengthy content from short prompts to summarizing vast amounts of text into concise outputs.

For agentic applications, which are a primary focus for Workers AI, handling large input token volumes is critical. This involves processing extensive system prompts, tool definitions, and growing conversational context.

Cloudflare prioritizes fast input token processing and rapid tool calling for these agentic workloads.

Prefill Decode (PD) Disaggregation for Efficiency

To boost performance, Cloudflare employs disaggregated prefill. This architectural choice separates the prefill stage (input processing, KV cache population) from the decode stage (output generation).

Prefill is typically compute-bound, while decode is memory-bound, meaning they utilize different GPU resources. Running both on a single machine leads to underutilized GPU power.

By disaggregating these stages, separate inference servers can be optimized and scaled independently. This allows for fine-tuning based on traffic patterns, whether input-heavy or output-heavy.

Implementing this requires a complex load balancer capable of routing requests, rewriting responses, and managing KV cache transfers between stages. Cloudflare developed token-aware load balancing to distribute workload evenly across prefill and decode endpoints.

Analysis of post-launch usage patterns led to configuration tuning, resulting in a significant reduction in tail latency variance. P90 time-to-first-token dropped, and inter-token latency saw a threefold improvement.

Prompt Caching for Long Contexts

Efficient prompt caching is essential for agentic use cases with long contexts, avoiding redundant computation of input tensors.

A `x-session-affinity` header routes requests to regions with pre-computed input tensors, enhancing throughput for applications like OpenCode.

Cloudflare incentivizes the use of this header through discounted cached tokens, promoting faster inference and lower costs.

Adoption of this header by internal users boosted input token cache hit ratios from 60% to 80% during peak times, significantly increasing request throughput.

KV-Cache Optimization Across Multiple GPUs

As models grow, a single instance can span multiple GPUs, necessitating efficient KV cache sharing.

Cloudflare leverages Moonshot AI’s Mooncake Transfer Engine and Store for high-performance data transfer across GPUs without CPU involvement.

This enables KV cache to be shared across nodes, allowing for more even load balancing and eliminating the need for session-aware routing within a cluster.

Mooncake Store extends cache capacity beyond GPU VRAM by utilizing NVMe storage, improving cache hit ratios and handling capacity.

Speculative Decoding for Faster Generation

Speculative decoding uses a smaller draft model to generate candidate tokens, which a larger target model then validates. This speeds up token generation by reducing the target model's computation.

For agentic tasks involving predictable structured outputs like tool calls, speculative decoding is particularly effective.

Cloudflare utilizes NVIDIA’s EAGLE-3 draft model for Kimi K2.5, achieving high-quality inference with increased tokens-per-second throughput.

Infire: Cloudflare's Proprietary Inference Engine

Cloudflare's Infire engine, written in Rust, is designed for efficient machine learning inference across its distributed network.

Infire now supports multi-GPU configurations, crucial for models exceeding single GPU VRAM capacity, such as Kimi K2.5 which requires multiple H100s.

It offers pipeline, tensor, and expert parallelism modes to optimize throughput and latency.

Further optimizations have reduced Infire's GPU memory overhead compared to vLLM, enabling larger context windows.

Infire achieves cold-start times under 20 seconds, even for the largest models, limited only by drive speed.

Maximizing Hardware Throughput

The Infire engine extracts up to 20% higher tokens-per-second throughput from hardware.

It also enables running the latest models on previously infeasible lower-end hardware.

Cloudflare continuously optimizes its technology stack to provide high-performance inference while ensuring efficient GPU utilization.