The insatiable demand for ultra-long context capabilities in frontier LLMs, spanning agentic workflows, repository-scale code reasoning, and persistent memory, is currently stymied by the quadratic cost of standard softmax attention. This computational barrier renders models untenable at deployment scale for contexts stretching into the millions of tokens.
Block-wise Sparsity Meets Grouped Query Attention
To surmount this challenge, the researchers introduce MiniMax Sparse Attention (MSA), a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA). MSA employs a lightweight Index Branch to score and select a Top-k subset of key-value blocks for each GQA group, enabling group-specific sparse retrieval. The Main Branch then executes exact block-sparse attention exclusively over these selected blocks. This design prioritizes simplicity and scalability, facilitating efficient deployment across diverse GPU architectures.