The pursuit of competitive performance in diffusion large language models (dLLMs) has historically necessitated massive parameter counts. While existing distillation techniques focus on reducing inference steps within a single architecture, they fail to address the crucial challenge of cross-architecture knowledge transfer. This limitation hinders the efficient scaling of dLLMs by preventing the transfer of insights from larger, more complex models to smaller, more agile ones with fundamentally different internal structures.
Bridging Architectural Divides in dLLMs
Researchers have introduced TIDE, the first framework designed for cross-architecture dLLM distillation. TIDE employs three novel components to facilitate knowledge transfer between teacher and student models that may differ in their architecture, attention mechanisms, and tokenizers. This breakthrough moves beyond intra-architecture distillation, enabling a more flexible and efficient path to deploying high-performing dLLMs.
TIDE: Modular Innovations for Knowledge Transfer
The TIDE framework comprises three key innovations: TIDAL, CompDemo, and Reverse CALM. TIDAL intelligently modulates distillation strength based on training progress and diffusion timestep, accounting for the teacher’s noise-dependent reliability. CompDemo enhances the teacher's contextual understanding by employing complementary mask splitting, particularly effective under heavy masking scenarios. Finally, Reverse CALM introduces a cross-tokenizer objective that inverts chunk-level likelihood matching, ensuring bounded gradients and robust dual-end noise filtering. These modular components collectively enable robust knowledge distillation across heterogeneous dLLM architectures.
Unlocking Efficiency and Performance Gains
The efficacy of TIDE is demonstrated by its ability to distill large models (8B dense and 16B MoE teachers) into a significantly smaller 0.6B student. Across eight benchmarks, this approach outperforms baselines by an average of 1.53 points. Notably, TIDE yields substantial improvements in code generation, with HumanEval scores reaching 48.78, a significant leap from the 32.3 achieved by the autoregressive baseline. This highlights the strategic advantage of TIDE in creating highly capable, yet computationally efficient, diffusion large language models.