The promise of Weight-Decomposed Low-Rank Adaptation (DoRA) for efficient large model adaptation is significantly hampered by its substantial memory overhead. The standard implementation requires materializing dense intermediate products for norm computation, leading to prohibitive VRAM usage, especially at high ranks and with numerous adapted modules. This bottleneck has limited the practical application of DoRA, particularly on single-GPU consumer hardware.
Eliminating the DoRA Memory Wall
The authors introduce a novel 'factored norm' approach that decomposes the squared norm calculation into terms computable with significantly less intermediate memory, avoiding the costly dense BA product. This innovation, coupled with fused Triton kernels that consolidate multiple DoRA operations into a single pass, slashes memory traffic by approximately 4x. The result is a numerically stable implementation that maintains precision even in challenging near-unity rescaling scenarios, a critical aspect for effective DoRA weight decomposition.