The inherent temporal redundancy in video, where adjacent frames largely overlap, presents a fundamental inefficiency for current video multimodal large language models (video MLLMs). These models typically process each sampled frame as an independent image, leading to redundant visual tokens and inflated computational costs. A new approach, detailed on arXiv, challenges this paradigm by proposing a more dynamic and efficient video interface.
Related startups
Predictive Visual Coding for Reduced Redundancy
The core innovation lies in a 'predictive visual code' that intelligently manages visual token transmission. Instead of encoding every frame fully, this system, instantiated as AdaCodec, selectively transmits a full reference frame only when scene prediction is unreliable. Otherwise, it encodes inter-frame changes—encompassing motion and prediction residuals—using compact 'P-tokens'. This adaptive strategy significantly minimizes the number of visual tokens required for video understanding.
Substantial Gains in Efficiency and Performance
AdaCodec demonstrates marked improvements over the baseline Qwen3-VL-8B model across eleven benchmarks. Even at a drastically reduced token budget (1/7th), AdaCodec with 32k tokens outperforms the 224k baseline on all long-video benchmarks. Furthermore, for general-video benchmarks, it not only elevates average scores but also slashes the time-to-first-token from 9.26s to a mere 1.62s. This efficiency leap makes real-time video analysis and interaction far more feasible.