DMax: Parallel Decoding for Diffusion LLMs

Progressive Self-Refinement Over Mask Embeddings

Traditional dLLMs rely on a binary mask-to-token transition. DMax reframes this process as a progressive self-refinement. Instead of a direct transition, the model iteratively refines mask embeddings into token embeddings. This core innovation allows for a more nuanced and robust decoding process, directly tackling the error accumulation problem inherent in parallel generation.

On-Policy Uniform Training for Robustness

Central to DMax's success is its novel training strategy: On-Policy Uniform Training. This method effectively unifies masked and uniform dLLMs, equipping the model with the ability to recover from both masked inputs and its own erroneous predictions during generation. This is crucial for maintaining accuracy when pushing the boundaries of decoding parallelism.

Soft Parallel Decoding for Extreme Efficiency

Building on the refined training, DMax introduces Soft Parallel Decoding. This technique represents intermediate decoding states as interpolations between predicted token embeddings and mask embeddings. This allows for iterative self-revision directly in the embedding space, facilitating significant speedups. Experiments show DMax improving TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86, while maintaining accuracy. On two H200 GPUs, the model achieves an average of 1,338 TPS at batch size 1, demonstrating a substantial leap in inference efficiency for diffusion language models.

DMax: Parallel Decoding for Diffusion LLMs

Progressive Self-Refinement Over Mask Embeddings

Related startups

On-Policy Uniform Training for Robustness

Soft Parallel Decoding for Extreme Efficiency

AI Daily Digest