Cursor has unveiled a new inference optimization for Mixture-of-Experts (MoE) models, dubbed "warp decode." This technique promises a substantial speedup and accuracy improvement for MoE model generation.
By fundamentally altering how computation is parallelized, warp decode achieves up to 1.84x faster inference on NVIDIA Blackwell GPUs. This advancement is crucial for the rapid iteration and deployment of models like Composer.
Flipping the Parallelism Axis
Traditional MoE inference organizes computation around specialized 'expert' networks. This works well for large batches but is inefficient for single-token autoregressive decoding, where one token is generated at a time.
The conventional approach involves significant overhead, with five out of eight stages dedicated solely to data management rather than computation. This inefficiency is particularly pronounced during decode operations.
Warp Decode: A New Approach
Warp decode tackles this by reorienting parallelism around output neurons. Each GPU 'warp' (a group of 32 parallel threads) is assigned to compute a single output value.
This shift eliminates cumbersome data staging, intermediate buffers, and cross-warp synchronization. The entire MoE compute layer is condensed into just two kernels.
