Unlocking Ultra-Long Context for LLMs

The insatiable demand for ultra-long context capabilities in frontier LLMs, spanning agentic workflows, repository-scale code reasoning, and persistent memory, is currently stymied by the quadratic cost of standard softmax attention. This computational barrier renders models untenable at deployment scale for contexts stretching into the millions of tokens.

Visual TL;DR. LLM context limit leads to MiniMax Sparse Attention. Demand for long context leads to MiniMax Sparse Attention. MiniMax Sparse Attention uses Index Branch. MiniMax Sparse Attention uses Main Branch. MiniMax Sparse Attention enables Breaks context barrier. MiniMax Sparse Attention delivers Practical speedups.

Related startups

LLM context limit: quadratic cost of standard softmax attention hinders ultra-long context
Demand for long context: agentic workflows, code reasoning, persistent memory require millions of tokens
MiniMax Sparse Attention: novel blockwise sparse attention mechanism built upon Grouped Query Attention
Index Branch: scores and selects Top-k key-value blocks for each GQA group
Main Branch: executes exact block-sparse attention over selected blocks
Breaks context barrier: enables millions of tokens with significant compute reduction
Practical speedups: optimized for GPU execution and efficient deployment across architectures

Visual TL;DRQuickExplainDeeper

Block-wise Sparsity Meets Grouped Query Attention

To surmount this challenge, the researchers introduce MiniMax Sparse Attention (MSA), a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA). MSA employs a lightweight Index Branch to score and select a Top-k subset of key-value blocks for each GQA group, enabling group-specific sparse retrieval. The Main Branch then executes exact block-sparse attention exclusively over these selected blocks. This design prioritizes simplicity and scalability, facilitating efficient deployment across diverse GPU architectures.

Optimized for GPU Execution and Practical Speedups

Translating theoretical sparsity into tangible performance gains required a co-designed GPU execution path. MSA leverages exp-free Top-k selection and KV-outer sparse attention to enhance tensor-core utilization under block-granular access. On a 109B-parameter multimodal model, MSA achieves performance parity with GQA while slashing per-token attention compute by an impressive 28.4x at a 1 million token context. Crucially, when paired with its optimized kernel, MSA delivers substantial wall-clock speedups: 14.2x for prefill and 7.6x for decoding on H800 hardware.

Unlocking Ultra-Long Context for LLMs

Related startups

Block-wise Sparsity Meets Grouped Query Attention

Optimized for GPU Execution and Practical Speedups

AI Daily Digest