Making large language models (LLMs) faster and lighter often runs into a hardware paradox: doing less computation can paradoxically slow them down. This stems from GPUs being optimized for dense, predictable data blocks, while the natural sparsity in LLMs creates irregular memory access.
A new collaboration between Sakana AI and NVIDIA aims to resolve this mismatch. Instead of forcing LLMs to adapt to GPU limitations, their approach reshapes the sparsity to fit the hardware. This work introduces novel open-source GPU kernels and data formats for optimizing sparse transformer language models, detailed in a recent paper.
