Cloudflare has engineered a novel approach to tackle a critical bottleneck in large language model (LLM) inference: model weights. Dubbed 'Unweight', the system achieves a significant 15-22% reduction in model size by employing lossless compression techniques, all without compromising output quality. This breakthrough is crucial for running inference efficiently, especially on high-performance hardware like NVIDIA H100 GPUs, where memory bandwidth, not compute power, often limits performance.
Generating a single token from an LLM requires accessing every model weight. On H100s, tensor cores can outpace memory delivery by nearly 600x. Cloudflare's Unweight system addresses this by decompressing weights directly into fast on-chip shared memory, feeding them straight to tensor cores and bypassing slow main memory. This strategy significantly reduces the bytes that need to traverse the memory bus, making inference faster and cheaper.
