Diffusion Language Models (DLMs) hold the promise of parallel token generation, a key advantage for faster AI development. However, many practical implementations still revert to left-to-right, autoregressive (AR) decoding. This limitation hinders the full exploitation of parallel hardware, creating bottlenecks that impact latency. This research probes the root cause of this AR-like behavior and proposes a novel solution. The original research can be found on arXiv.
The authors argue that the primary reason DLMs exhibit AR-like decoding is a fundamental misalignment between their training objectives and the inherently sequential nature of common training datasets. This includes standard pretraining corpora and even long chain-of-thought supervision used for complex reasoning tasks. To address this, they introduce NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept focused on data curation. NAP restructures training data by creating multiple independent reasoning paths for each example and couples this with a parallel-forced decoding strategy. This encourages the model to update multiple tokens simultaneously, pushing it away from sequential generation.