Mamba 2 JAX: Hardware Agnostic SSMs

Mamba 2 JAX breaks hardware dependency for state-space models, achieving high performance on CPU, GPU, and TPU via XLA compilation without custom kernels.

Mar 11 at 8:01 PM2 min read
Diagram illustrating the mapping of Mamba 2's state-space duality algorithm onto XLA's fusion and tiling passes for cross-platform deployment.

The proliferation of state-space models (SSMs) has been tethered to proprietary, hardware-specific acceleration kernels, primarily for NVIDIA GPUs. This dependency creates a significant barrier to broader adoption and experimentation across different hardware ecosystems. Cosmo Santoni's work challenges this paradigm, demonstrating that the architectural nuances of Mamba 2 map effectively onto standard compiler optimizations.

XLA Unlocks SSM Performance Without Custom Kernels

The core insight presented is that Mamba 2's state space duality, characterized by its diagonal state structure, chunkable recurrence, and einsum-dominated computation with static control flow, aligns precisely with what the XLA compiler is designed to optimize. By leveraging XLA's fusion and tiling passes, the researchers implemented the full inference path—including prefill and cached autoregressive decoding—using only shaped standard primitives. This eliminates the need for hand-written CUDA or Triton kernels, making the architecture performant on any platform with a mature XLA backend, including CPUs, NVIDIA GPUs, and Google Cloud TPUs, all from a single JAX source. This implementation of Mamba 2 JAX showcases a significant step towards hardware-agnostic AI model deployment.

On-Device State Management and Cross-Platform Efficiency

A key achievement of this approach is the realization of the architecture's theoretical O(1) state management as a compiled, on-device cache. This design circumvents the need for host synchronization during generation, a critical bottleneck in traditional autoregressive models. Benchmarks on TPU v6e across five model scales (130M to 2.7B parameters) reveal that XLA-generated code achieves approximately 140 TFLOPS on single-stream prefill (15% MFU) and up to 64% bandwidth utilization during decoding. Crucially, the greedy decoding output precisely matches the PyTorch/CUDA reference token-for-token over 64 steps, with hidden-state agreement within float32 rounding tolerance. This pattern is transferable to any SSM recurrence that meets the same structural conditions, highlighting the broad applicability of this compilation strategy for models like Mamba 2 JAX.