Together AI is positioning itself as the go-to platform for demanding large language models, announcing its role as the preferred cloud partner for MiniMax's latest M3 model. The company has detailed significant engineering breakthroughs enabling efficient MiniMax M3 inference, unlocking the model's ambitious 1 million token context window and native multimodal capabilities.
Related startups
This collaboration highlights Together AI's commitment to tackling complex systems challenges for cutting-edge AI. MiniMax M3, designed for advanced coding, agentic workflows, and multimodal reasoning, presents unique serving demands, particularly with its extended context length and rich input processing requirements.
Engineering for Extreme Context and Multimodality
The core of MiniMax M3's efficiency challenge lies in its novel MiniMax Sparse Attention (MSA) mechanism. This architecture reduces the computational burden of long contexts by limiting the tokens each query attends to, a critical departure from quadratic scaling. Together AI's team developed a KV-Block-Major sparse attention kernel to optimize this, improving arithmetic intensity by reorganizing data flow.
Further enhancing long-context handling, Together AI integrated MSA with paged attention. This allows for dynamic KV cache management, crucial for variable request lengths, and reportedly yielded a 5% boost in decode throughput.
