Vision-language models (VLMs) grapple with a fundamental scaling challenge: projecting images into thousands of visual tokens creates significant computational and memory overhead during decoder inference. Existing approaches to vision-language models token reduction primarily rely on a rigid "rank-and-remove" strategy, permanently discarding tokens deemed less important early on. However, this irreversible action proves fragile, as the relevance of visual tokens can shift dramatically across different decoder layers, particularly for queries requiring precise spatial grounding. This limitation is addressed by a new training-free plug-in method, Reroute, which offers a paradigm shift from removal to recoverable routing.
Related startups
From Pruning to Dynamic Routing
Reroute fundamentally redefines vision-language models token reduction by replacing permanent discarding with a dynamic routing mechanism. At each stage of the decoder, selected visual tokens proceed through the computational blocks, while others are deferred. These deferred tokens are not lost; instead, they re-enter the candidate pool for consideration at the subsequent routing decision point. This recoverable approach leverages existing attention-score ranking rules and stage-wise schedules. Crucially, Reroute preserves the theoretical TFLOPs and KV-cache memory budget class of the pruning methods it augments, offering an efficiency-preserving enhancement.
Enhanced Grounding Without Performance Sacrifice
The practical implications of Reroute are significant. When applied to variants like FastV, PDrop, and Nüwa, utilizing LLaVA-1.5 and Qwen backbones, the Reroute plug-in demonstrates a marked improvement in grounding capabilities under aggressive token reduction scenarios. This enhanced spatial understanding is achieved while maintaining general Visual Question Answering (VQA) performance. The findings from the arXiv paper suggest that the future of efficient VLM operation lies not in irreversible pruning, but in intelligent, recoverable routing of visual tokens.