Together AI Masters MiniMax M3 Inference

Together AI details engineering feats enabling efficient MiniMax M3 inference, unlocking 1M-token context and multimodality.

Jun 2 at 8:02 PM6 min read

Together AI partners with MiniMax for efficient M3 model inference.· Together AI

Visual TL;DR. MiniMax M3 Demands leads to Extreme Context. MiniMax M3 Demands leads to Native Multimodality. Extreme Context addressed by MiniMax Sparse Attention. Native Multimodality addressed by MiniMax Sparse Attention. MiniMax Sparse Attention supported by Together AI Platform. MiniMax Sparse Attention enables Efficient Inference. Together AI Platform enables Efficient Inference. Efficient Inference leads to Advanced AI Unlocked.

MiniMax M3 Demands: advanced coding, agentic workflows, multimodal reasoning needs
Extreme Context: unlocking 1 million token context window
Native Multimodality: rich input processing requirements for diverse data
MiniMax Sparse Attention: novel mechanism reducing computational burden of long contexts
Together AI Platform: preferred cloud partner for MiniMax M3
Efficient Inference: enabling complex systems challenges for cutting-edge AI
Advanced AI Unlocked: powering demanding large language models

Visual TL;DRQuickExplainDeeper

Together AI is positioning itself as the go-to platform for demanding large language models, announcing its role as the preferred cloud partner for MiniMax's latest M3 model. The company has detailed significant engineering breakthroughs enabling efficient MiniMax M3 inference, unlocking the model's ambitious 1 million token context window and native multimodal capabilities.

This collaboration highlights Together AI's commitment to tackling complex systems challenges for cutting-edge AI. MiniMax M3, designed for advanced coding, agentic workflows, and multimodal reasoning, presents unique serving demands, particularly with its extended context length and rich input processing requirements.

Engineering for Extreme Context and Multimodality

The core of MiniMax M3's efficiency challenge lies in its novel MiniMax Sparse Attention (MSA) mechanism. This architecture reduces the computational burden of long contexts by limiting the tokens each query attends to, a critical departure from quadratic scaling. Together AI's team developed a KV-Block-Major sparse attention kernel to optimize this, improving arithmetic intensity by reorganizing data flow.

Further enhancing long-context handling, Together AI integrated MSA with paged attention. This allows for dynamic KV cache management, crucial for variable request lengths, and reportedly yielded a 5% boost in decode throughput.

The model's multimodal capabilities necessitated a dedicated preprocessing pipeline. A new Rust-based Serving Model Gateway (SMG) now handles image and video decoding, resizing, and patching on the CPU. This offloads GPU resources, ensuring the inference engine focuses on generation.

These optimizations collectively resulted in performance improvements of 81% to 125% across various concurrency levels for agentic-style workloads, according to Together AI's internal benchmarks.

Together AI will host the open-weights MiniMax M3 model as a developer endpoint upon its public release.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#MiniMax #Together AI #LLM Inference #Sparse Attention #Multimodal AI #Large Language Models #AI Infrastructure