The relentless pursuit of scale in Large Language Models (LLMs) has primarily focused on increasing depth. However, this architectural evolution is hampered by a fundamental challenge: signal degradation. As information propagates through numerous residual updates in deeper layers, crucial features learned in shallower layers become diluted, hindering model performance. Addressing this directly, researchers have introduced Mixture-of-Depths Attention (MoDA), a novel mechanism designed to preserve and leverage information across different depths within the model.
Revitalizing Deep Layer Information Flow
MoDA fundamentally redefines how attention heads operate. Instead of solely attending to key-value pairs within the current layer, each attention head is empowered to access sequence KV pairs from its current layer alongside depth KV pairs from preceding layers. This cross-depth attention capability acts as a potent countermeasure against the signal dilution problem. By allowing deeper layers to directly reference and integrate information from shallower layers, MoDA ensures that valuable learned features are not lost but are instead reinforced and utilized more effectively as the model processes information through its increasing depth.
Hardware-Efficient Scaling with Near-Optimal Performance
A critical aspect of MoDA's innovation lies in its practical implementation. The researchers have developed a hardware-efficient algorithm that effectively manages the non-contiguous memory access patterns inherent in the MoDA mechanism. This optimization is crucial for real-world deployment, allowing MoDA to achieve an impressive 97.3% of FlashAttention-2's efficiency at a substantial sequence length of 64K. This efficiency, coupled with a negligible 3.7% increase in FLOPs, makes MoDA a scalable solution. Experiments with 1.5B-parameter models validate its efficacy, showing consistent outperformance against strong baselines, including a notable 0.2 average perplexity improvement across 10 validation benchmarks and a 2.11% increase in average performance on 10 downstream tasks.