Cursor's Warp Decode Boosts MoE Inference

Cursor's new 'warp decode' technique dramatically boosts MoE model inference speed and accuracy on Blackwell GPUs by rethinking parallelism.

Apr 6 at 8:01 PM3 min read

Cursor's Warp Decode Boosts MoE Inference — Cursor Blog

Cursor has unveiled a new inference optimization for Mixture-of-Experts (MoE) models, dubbed "warp decode." This technique promises a substantial speedup and accuracy improvement for MoE model generation.

By fundamentally altering how computation is parallelized, warp decode achieves up to 1.84x faster inference on NVIDIA Blackwell GPUs. This advancement is crucial for the rapid iteration and deployment of models like Composer.

Flipping the Parallelism Axis

Traditional MoE inference organizes computation around specialized 'expert' networks. This works well for large batches but is inefficient for single-token autoregressive decoding, where one token is generated at a time.

The conventional approach involves significant overhead, with five out of eight stages dedicated solely to data management rather than computation. This inefficiency is particularly pronounced during decode operations.

Warp Decode: A New Approach

Warp decode tackles this by reorienting parallelism around output neurons. Each GPU 'warp' (a group of 32 parallel threads) is assigned to compute a single output value.

This shift eliminates cumbersome data staging, intermediate buffers, and cross-warp synchronization. The entire MoE compute layer is condensed into just two kernels.

Dual Kernels, Single Pass

The process involves a 'gate/up' kernel and a 'down' kernel. In the gate/up phase, each warp loads necessary weights and input activations, performing calculations directly into registers.

Crucially, these two kernels are fused, allowing the activation vector to be read once and reused for both projections, avoiding shared memory staging.

The down kernel then aggregates results from multiple experts into a single accumulator per warp. This aggregation bypasses shared memory and explicit barriers, leveraging hardware primitives for efficient synchronization.

Streamlined Pipeline, Enhanced Performance

Warp decode streamlines the inference pipeline by eliminating unnecessary stages and buffers. Padding, scattering, and combining steps are removed.

Intermediate buffers, like the activation gather buffer and per-expert output buffer, are also eliminated. This frees up valuable L2 cache capacity for critical weight data.

Embarrassingly Parallel Design

The core of warp decode's efficiency lies in its 'embarrassingly parallel' design. Each warp operates independently, with no shared mutable state between them.

This independence allows the GPU scheduler to dynamically allocate work, effectively hiding memory latency by switching to ready warps. The result is a linear scaling of performance with increased output dimensions or token batches.

Tangible Results: Speed and Accuracy

Testing on NVIDIA B200 GPUs demonstrated a consistent throughput gain of 1.8x. This improvement holds across various context lengths, confirming it's a pure generation-time benefit.

Beyond speed, warp decode enhances accuracy. By keeping activations in BF16 and accumulators in FP32, it avoids quantization errors inherent in the traditional path. This results in outputs 1.4x closer to full FP32 precision.

Hardware Efficiency and Scalability

The technique pushes hardware limits, sustaining 58% of the B200's peak memory bandwidth. This efficiency is vital for high-performance AI workloads.

Correctness against reference implementations remains exceptionally high, ensuring reliability across different batch sizes. This approach is a key MoE model inference optimization.

Composer Training Benefits

Warp decode is specifically advantageous for MoE decode scenarios where shared work per expert is minimal. It complements, rather than replaces, expert-centric execution for tasks like prefill and large-batch inference.

This optimization accelerates the Composer research and training pipeline, enabling faster model improvements and more frequent updates for developers.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#MoE #Inference #NVIDIA Blackwell #GPU Computing #AI Research #Cursor #Composer