The insatiable demand for ultra-long context capabilities in frontier LLMs, spanning agentic workflows, repository-scale code reasoning, and persistent memory, is currently stymied by the quadratic cost of standard softmax attention. This computational barrier renders models untenable at deployment scale for contexts stretching into the millions of tokens.
Related startups
Block-wise Sparsity Meets Grouped Query Attention
To surmount this challenge, the researchers introduce MiniMax Sparse Attention (MSA), a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA). MSA employs a lightweight Index Branch to score and select a Top-k subset of key-value blocks for each GQA group, enabling group-specific sparse retrieval. The Main Branch then executes exact block-sparse attention exclusively over these selected blocks. This design prioritizes simplicity and scalability, facilitating efficient deployment across diverse GPU architectures.
Optimized for GPU Execution and Practical Speedups
Translating theoretical sparsity into tangible performance gains required a co-designed GPU execution path. MSA leverages exp-free Top-k selection and KV-outer sparse attention to enhance tensor-core utilization under block-granular access. On a 109B-parameter multimodal model, MSA achieves performance parity with GQA while slashing per-token attention compute by an impressive 28.4x at a 1 million token context. Crucially, when paired with its optimized kernel, MSA delivers substantial wall-clock speedups: 14.2x for prefill and 7.6x for decoding on H800 hardware.