Rethinking VLM Token Reduction

Vision-language models (VLMs) grapple with a fundamental scaling challenge: projecting images into thousands of visual tokens creates significant computational and memory overhead during decoder inference. Existing approaches to vision-language models token reduction primarily rely on a rigid "rank-and-remove" strategy, permanently discarding tokens deemed less important early on. However, this irreversible action proves fragile, as the relevance of visual tokens can shift dramatically across different decoder layers, particularly for queries requiring precise spatial grounding. This limitation is addressed by a new training-free plug-in method, Reroute, which offers a paradigm shift from removal to recoverable routing.

Visual TL;DR. VLM Token Overhead leads to Rigid Pruning. Rigid Pruning leads to Token Relevance Shifts. Token Relevance Shifts problem addressed by Reroute Method. Reroute Method introduces Dynamic Routing. Dynamic Routing enables Recoverable Routing. Recoverable Routing results in Improved Grounding. Improved Grounding leads to No Performance Sacrifice.

VLM Token Overhead: projecting images into thousands of visual tokens creates significant overhead
Rigid Pruning: permanently discarding tokens deemed less important early on
Token Relevance Shifts: relevance of visual tokens can shift dramatically across different decoder layers
Reroute Method: training-free plug-in method offering a paradigm shift from removal
Dynamic Routing: replaces permanent discarding with a dynamic routing mechanism
Recoverable Routing: transforms token reduction from irreversible pruning to recoverable routing
Improved Grounding: improving grounding performance without sacrificing efficiency
No Performance Sacrifice: maintaining efficiency while enhancing grounding capabilities

Visual TL;DRQuickExplainDeeper

From Pruning to Dynamic Routing

Reroute fundamentally redefines vision-language models token reduction by replacing permanent discarding with a dynamic routing mechanism. At each stage of the decoder, selected visual tokens proceed through the computational blocks, while others are deferred. These deferred tokens are not lost; instead, they re-enter the candidate pool for consideration at the subsequent routing decision point. This recoverable approach leverages existing attention-score ranking rules and stage-wise schedules. Crucially, Reroute preserves the theoretical TFLOPs and KV-cache memory budget class of the pruning methods it augments, offering an efficiency-preserving enhancement.

Rethinking VLM Token Reduction

From Pruning to Dynamic Routing

Related startups

Enhanced Grounding Without Performance Sacrifice

AI Daily Digest