DoRA Efficiency Breakthrough

The promise of Weight-Decomposed Low-Rank Adaptation (DoRA) for efficient large model adaptation is significantly hampered by its substantial memory overhead. The standard implementation requires materializing dense intermediate products for norm computation, leading to prohibitive VRAM usage, especially at high ranks and with numerous adapted modules. This bottleneck has limited the practical application of DoRA, particularly on single-GPU consumer hardware.

Eliminating the DoRA Memory Wall

The authors introduce a novel 'factored norm' approach that decomposes the squared norm calculation into terms computable with significantly less intermediate memory, avoiding the costly dense BA product. This innovation, coupled with fused Triton kernels that consolidate multiple DoRA operations into a single pass, slashes memory traffic by approximately 4x. The result is a numerically stable implementation that maintains precision even in challenging near-unity rescaling scenarios, a critical aspect for effective DoRA weight decomposition.

Accelerating Adaptation Across Generations

Empirical results demonstrate a clear performance advantage for the new system. Across multiple large vision-language models and a range of NVIDIA GPUs (including RTX 6000 PRO, H200, and B200), inference speeds increased by 1.5-2.0x and gradient computation by 1.5-1.9x compared to existing Hugging Face PEFT implementations. Crucially, peak VRAM usage was reduced by up to 7 GB. These optimizations extend across different GPU architectures, confirming substantial speedups and enabling the practical use of high-rank DoRA weight decomposition in resource-constrained environments.

DoRA Efficiency Breakthrough

Eliminating the DoRA Memory Wall

Related startups

Accelerating Adaptation Across Generations

AI Daily Digest