Diffusion large language models (dLLMs) offer a compelling alternative to autoregressive (AR) models with potential for faster inference. However, their masked language modeling paradigm has historically precluded them from benefiting from speculative decoding, a critical acceleration technique for AR models. This paper introduces a solution to this disconnect.
Related startups
Unlocking Speculative Decoding for dLLMs
The core challenge lies in the dLLM's masked language modeling formulation, which relies on bidirectional attention and mask tokens. Unlike AR models where causal masking ensures temporally valid contexts for token verification, dLLMs' context shifts across denoising steps. This prevents standard token-level speculative verification. The proposed solution, SimSD, introduces a plug-and-play masking strategy. By incorporating reference tokens from a draft model and carefully designing an attention mask, SimSD equips dLLMs with temporally valid contexts. This enables them to compute valid logits for multiple drafted tokens in a single forward pass, effectively restoring the verification capability crucial for speculative decoding while retaining dLLMs' parallel decoding advantages.
Significant Throughput Gains with Quality Preservation
The SimSD speculative decoding algorithm is training-free and integrates seamlessly with other acceleration methods like KV caching and blockwise decoding. Experiments on the SDAR-family dLLMs across four benchmarks demonstrate substantial performance improvements. The researchers observed up to 7.46x higher decoding throughput. Critically, this acceleration was achieved while maintaining, and in some cases even improving, the average generation quality. This suggests that SimSD offers a robust path to significantly enhance the efficiency of dLLM inference without compromising output quality.