Faster LLMs by Reshaping Sparsity

Sakana AI and NVIDIA unveil a new method that reshapes sparsity in LLMs to boost GPU efficiency, achieving over 20% speedups.

Daniel Singer

May 9 at 2:01 AM2 min read

Abstract visualization of data sparsity and GPU architecture interaction. — A visual representation of how sparse data is optimized for GPU processing.· Sakana

Making large language models (LLMs) faster and lighter often runs into a hardware paradox: doing less computation can paradoxically slow them down. This stems from GPUs being optimized for dense, predictable data blocks, while the natural sparsity in LLMs creates irregular memory access.

A new collaboration between Sakana AI and NVIDIA aims to resolve this mismatch. Instead of forcing LLMs to adapt to GPU limitations, their approach reshapes the sparsity to fit the hardware. This work introduces novel open-source GPU kernels and data formats for optimizing sparse transformer language models, detailed in a recent paper.

Related startups

The core of the innovation is a new format called TwELL (Tile-wise ELLPACK). This hybrid representation dynamically routes the 99% of sparse tokens through a fast path, using a dense backup matrix for less frequent, intensive tokens. This design ensures GPUs can process data more efficiently.

The contribution is twofold: TwELL, a sparse packing format designed for optimized tiled matrix multiplication kernels, and custom CUDA kernels that fuse multiple sparse operations. These kernels compress TwELL to minimize activation sizes, maximizing throughput.

Benchmarks at billion-parameter scale demonstrated over 20% speed improvements and significant reductions in peak memory and energy consumption. This research offers a promising path toward more efficient LLMs, impacting everything from optimizing LLM inference and training to enabling smaller, more accessible models, as discussed in perspectives on optimizing LLM inference and training.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Sparse AI #Transformer Models #LLM Optimization #GPU Computing #NVIDIA #Sakana AI #ICML 2026 #CUDA Kernels #TwELL

AI Daily Digest

Get the most important AI news daily.

+40k readers