DMax: Parallel Decoding for Diffusion LLMs

DMax revolutionizes diffusion language models with Soft Parallel Decoding, boosting TPF significantly while preserving accuracy and achieving 1,338 TPS.

2 min read
DMax: Parallel Decoding for Diffusion LLMs

The quest for efficient large language models (LLMs) often centers on optimizing inference. Diffusion language models (dLLMs), while promising, have grappled with error accumulation during parallel decoding. A new approach, DMax, introduces a paradigm shift to address this limitation, enabling aggressive parallelism without sacrificing generation quality.

Progressive Self-Refinement Over Mask Embeddings

Traditional dLLMs rely on a binary mask-to-token transition. DMax reframes this process as a progressive self-refinement. Instead of a direct transition, the model iteratively refines mask embeddings into token embeddings. This core innovation allows for a more nuanced and robust decoding process, directly tackling the error accumulation problem inherent in parallel generation.

On-Policy Uniform Training for Robustness

Central to DMax's success is its novel training strategy: On-Policy Uniform Training. This method effectively unifies masked and uniform dLLMs, equipping the model with the ability to recover from both masked inputs and its own erroneous predictions during generation. This is crucial for maintaining accuracy when pushing the boundaries of decoding parallelism.

Soft Parallel Decoding for Extreme Efficiency

Building on the refined training, DMax introduces Soft Parallel Decoding. This technique represents intermediate decoding states as interpolations between predicted token embeddings and mask embeddings. This allows for iterative self-revision directly in the embedding space, facilitating significant speedups. Experiments show DMax improving TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86, while maintaining accuracy. On two H200 GPUs, the model achieves an average of 1,338 TPS at batch size 1, demonstrating a substantial leap in inference efficiency for diffusion language models.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.