Bridging Diffusion LLMs and Speculative Decoding

A novel SimSD speculative decoding method enables diffusion LLMs to achieve up to 7.46x higher throughput without sacrificing generation quality.

Jun 2 at 8:00 PM5 min read

Diagram illustrating the SimSD speculative decoding process for diffusion language models. — The SimSD framework enables speculative decoding for diffusion LLMs.

Visual TL;DR. dLLMs vs AR Models problem Speculative Decoding Barrier. Speculative Decoding Barrier solution SimSD Method. SimSD Method enables Temporally Valid Contexts. Temporally Valid Contexts leads to Throughput Gains. Throughput Gains and Quality Preservation.

dLLMs vs AR Models: diffusion LLMs offer faster inference potential than autoregressive models
Speculative Decoding Barrier: dLLMs' masked modeling prevents standard token-level speculative verification
SimSD Method: plug-and-play masking strategy with reference tokens and attention mask
Temporally Valid Contexts: enables dLLMs to compute valid contexts for token verification
Throughput Gains: achieve up to 7.46x higher throughput
Quality Preservation: without sacrificing generation quality

Visual TL;DRQuickExplainDeeper

Diffusion large language models (dLLMs) offer a compelling alternative to autoregressive (AR) models with potential for faster inference. However, their masked language modeling paradigm has historically precluded them from benefiting from speculative decoding, a critical acceleration technique for AR models. This paper introduces a solution to this disconnect.

Unlocking Speculative Decoding for dLLMs

The core challenge lies in the dLLM's masked language modeling formulation, which relies on bidirectional attention and mask tokens. Unlike AR models where causal masking ensures temporally valid contexts for token verification, dLLMs' context shifts across denoising steps. This prevents standard token-level speculative verification. The proposed solution, SimSD, introduces a plug-and-play masking strategy. By incorporating reference tokens from a draft model and carefully designing an attention mask, SimSD equips dLLMs with temporally valid contexts. This enables them to compute valid logits for multiple drafted tokens in a single forward pass, effectively restoring the verification capability crucial for speculative decoding while retaining dLLMs' parallel decoding advantages.

Significant Throughput Gains with Quality Preservation

The SimSD speculative decoding algorithm is training-free and integrates seamlessly with other acceleration methods like KV caching and blockwise decoding. Experiments on the SDAR-family dLLMs across four benchmarks demonstrate substantial performance improvements. The researchers observed up to 7.46x higher decoding throughput. Critically, this acceleration was achieved while maintaining, and in some cases even improving, the average generation quality. This suggests that SimSD offers a robust path to significantly enhance the efficiency of dLLM inference without compromising output quality.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#AI Research #LLMs #Model Acceleration #Diffusion Models #Speculative Decoding #arXiv