Cloudflare Unweights LLMs by 22%

Cloudflare has engineered a novel approach to tackle a critical bottleneck in large language model (LLM) inference: model weights. Dubbed 'Unweight', the system achieves a significant 15-22% reduction in model size by employing lossless compression techniques, all without compromising output quality. This breakthrough is crucial for running inference efficiently, especially on high-performance hardware like NVIDIA H100 GPUs, where memory bandwidth, not compute power, often limits performance.

Generating a single token from an LLM requires accessing every model weight. On H100s, tensor cores can outpace memory delivery by nearly 600x. Cloudflare's Unweight system addresses this by decompressing weights directly into fast on-chip shared memory, feeding them straight to tensor cores and bypassing slow main memory. This strategy significantly reduces the bytes that need to traverse the memory bus, making inference faster and cheaper.

The Compression Challenge

While techniques like quantization can reduce model size, they are lossy, potentially impacting response quality. Cloudflare sought a lossless method that preserves exact model behavior. Existing methods often target storage or use specialized hardware, neither fitting Cloudflare's need for inference-time decompression on Hopper GPUs integrated with their Rust-based inference engine.

The core challenge lies not just in compressing LLM weights, but in decompressing them fast enough to avoid slowing down inference. On Hopper GPUs, compute units cannot simultaneously run decompression and matrix multiplication kernels due to shared memory constraints. Any decompression latency not perfectly overlapped with computation directly adds to token latency.

Exploiting Exponent Redundancy

LLM weights are typically stored as 16-bit Brain Floating Point (BF16) numbers, comprising sign, exponent, and mantissa. Cloudflare's Unweight focuses on the exponent, which exhibits predictable patterns across trained LLMs. Out of 256 possible exponent values, a small subset, typically the top 16, accounts for over 99% of all weights in a layer. This redundancy is exploited using Huffman coding, assigning shorter codes to common exponents.

This selective compression, applied primarily to Multi-Layer Perceptron (MLP) weights—which constitute the majority of parameters and memory traffic—achieves approximately 30% compression on the exponent stream. This translates to an overall 15-22% reduction in MLP weight size, saving around 3 GB of VRAM per model.

Flexible Inference Pipelines

Unweight offers four distinct execution strategies, dynamically chosen based on workload characteristics like batch size and weight matrix shape. These pipelines balance decompression effort against computational complexity.

At one end, a 'full decode' reconstructs original BF16 weights for standard matrix multiplication. At the other, 'direct palette' skips preprocessing by pre-transcoding weights to a 4-bit format, with the matrix multiplication kernel reconstructing BF16 values on-the-fly. Intermediate options include 'exponent-only decode' and 'palette transcode', each offering different levels of decompression traffic reduction.

The optimal pipeline depends on the situation. Small batch sizes may favor simpler pipelines with lower overhead, while large batch sizes benefit from lighter preprocessing that frees up memory bandwidth. Different weight matrices within the same layer can also favor different pipelines.

Autotuning for Peak Performance

To navigate the complex configuration space—including pipeline choice, custom matrix multiplication kernel variants, and resource allocation between decompression and computation—Unweight employs an autotuner. This system measures actual end-to-end inference throughput on target hardware, dynamically selecting the most efficient strategy for each weight matrix and batch size combination.

Three of the four pipelines utilize a custom reconstructive matrix multiplication kernel. This kernel fuses decompression with computation, loading compressed data from HBM, reconstructing BF16 values in fast shared memory, and feeding them directly to tensor cores without a main memory round-trip. Producer thread groups load compressed data, while consumer groups reconstruct BF16 values and execute tensor core instructions.

The system also leverages pipelining across transformer layers. By classifying layers as 'hard' (requiring Huffman decoding) or 'easy' (using pre-transcoded data), Cloudflare can perform decompression on separate CUDA streams during the computation of easy layers. This ensures that preprocessed weights are ready when needed for hard layers, effectively hiding decompression latency.

This work is detailed further in a technical paper and the GPU kernels are open-sourced, promoting greater transparency and innovation in LLM compression techniques.

Cloudflare Unweights LLMs by 22%

The Compression Challenge

Related startups

Exploiting Exponent Redundancy

Flexible Inference Pipelines

Autotuning for Peak Performance

AI Daily Digest