Vision-language models (VLMs) grapple with a fundamental scaling challenge: projecting images into thousands of visual tokens creates significant computational and memory overhead during decoder inference. Existing approaches to vision-language models token reduction primarily rely on a rigid "rank-and-remove" strategy, permanently discarding tokens deemed less important early on. However, this irreversible action proves fragile, as the relevance of visual tokens can shift dramatically across different decoder layers, particularly for queries requiring precise spatial grounding. This limitation is addressed by a new training-free plug-in method, Reroute, which offers a paradigm shift from removal to recoverable routing.
From Pruning to Dynamic Routing
Reroute fundamentally redefines vision-language models token reduction by replacing permanent discarding with a dynamic routing mechanism. At each stage of the decoder, selected visual tokens proceed through the computational blocks, while others are deferred. These deferred tokens are not lost; instead, they re-enter the candidate pool for consideration at the subsequent routing decision point. This recoverable approach leverages existing attention-score ranking rules and stage-wise schedules. Crucially, Reroute preserves the theoretical TFLOPs and KV-cache memory budget class of the pruning methods it augments, offering an efficiency-preserving enhancement.