Cloudflare has engineered a novel approach to tackle a critical bottleneck in large language model (LLM) inference: model weights. Dubbed 'Unweight', the system achieves a significant 15-22% reduction in model size by employing lossless compression techniques, all without compromising output quality. This breakthrough is crucial for running inference efficiently, especially on high-performance hardware like NVIDIA H100 GPUs, where memory bandwidth, not compute power, often limits performance.
Generating a single token from an LLM requires accessing every model weight. On H100s, tensor cores can outpace memory delivery by nearly 600x. Cloudflare's Unweight system addresses this by decompressing weights directly into fast on-chip shared memory, feeding them straight to tensor cores and bypassing slow main memory. This strategy significantly reduces the bytes that need to traverse the memory bus, making inference faster and cheaper.
The Compression Challenge
While techniques like quantization can reduce model size, they are lossy, potentially impacting response quality. Cloudflare sought a lossless method that preserves exact model behavior. Existing methods often target storage or use specialized hardware, neither fitting Cloudflare's need for inference-time decompression on Hopper GPUs integrated with their Rust-based inference engine.
The core challenge lies not just in compressing LLM weights, but in decompressing them fast enough to avoid slowing down inference. On Hopper GPUs, compute units cannot simultaneously run decompression and matrix multiplication kernels due to shared memory constraints. Any decompression latency not perfectly overlapped with computation directly adds to token latency.
