The escalating computational demands of Vision-Language Models (VLMs), driven by massive visual token processing, present a critical bottleneck for scalability. Existing training-aware pruning techniques often falter under aggressive compression due to their reliance on continuous approximations for an inherently discrete problem.
Related startups
Unlocking Discrete Optimization with Reinforcement Learning
To circumvent the limitations of gradient-based methods that frequently trap optimization in local minima, the GRIP-VLM framework introduces a novel approach. By formulating visual token pruning as a Markov Decision Process, GRIP-VLM leverages a Group Relative Policy Optimization (GRPO) paradigm. This RL-driven strategy, augmented by supervised warm-up, directly navigates the discrete search space, enabling more effective and less constrained pruning decisions. This marks a significant departure from prior attempts at Vision-Language Model pruning.
Adaptive Pruning for Unprecedented Efficiency
GRIP-VLM's architecture features a lightweight agent equipped with a budget-aware scorer. This agent dynamically assesses the importance of each token and can adapt to any compression ratio without requiring a full retraining cycle. Extensive evaluations across diverse multimodal benchmarks confirm GRIP-VLM's superiority over heuristic and supervised baselines. The framework consistently achieves a more favorable Pareto frontier, delivering up to a 15% inference speedup while maintaining accuracy, thereby addressing a core challenge in Vision-Language Model pruning.