The quest for efficient large language models (LLMs) often centers on optimizing inference. Diffusion language models (dLLMs), while promising, have grappled with error accumulation during parallel decoding. A new approach, DMax, introduces a paradigm shift to address this limitation, enabling aggressive parallelism without sacrificing generation quality.
Progressive Self-Refinement Over Mask Embeddings
Traditional dLLMs rely on a binary mask-to-token transition. DMax reframes this process as a progressive self-refinement. Instead of a direct transition, the model iteratively refines mask embeddings into token embeddings. This core innovation allows for a more nuanced and robust decoding process, directly tackling the error accumulation problem inherent in parallel generation.