AdaCodec: Efficient Video MLLM Encoding

AdaCodec revolutionizes video MLLMs by using predictive visual coding to drastically cut tokenization costs and latency, achieving superior performance at a fraction of the budget.

Jun 2 at 8:01 PM7 min read

Diagram illustrating AdaCodec's adaptive visual tokenization strategy for video MLLMs. — AdaCodec's adaptive approach to visual tokenization.

Visual TL;DR. Video MLLM Inefficiency problem AdaCodec Introduced. Video MLLM Inefficiency leads to Temporal Redundancy. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding leads to Selective Frame Encoding. Predictive Visual Coding generates Compact P-tokens. Selective Frame Encoding leads to Reduced Token Count. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance.

Video MLLM Inefficiency: processing adjacent frames as independent images leads to redundant tokens
Temporal Redundancy: adjacent video frames largely overlap, causing inflated computational costs
AdaCodec Introduced: a new dynamic and efficient video interface for MLLMs
Predictive Visual Coding: intelligently manages visual token transmission based on scene prediction
Selective Frame Encoding: transmits full reference frames only when scene prediction is unreliable
Compact P-tokens: encodes inter-frame changes like motion and prediction residuals
Reduced Token Count: significantly minimizes visual tokens required for video understanding
Efficiency Gains: drastically cuts tokenization costs and latency for video MLLMs
Superior Performance: achieves better results at a fraction of the computational budget

Visual TL;DRQuickExplainDeeper

The inherent temporal redundancy in video, where adjacent frames largely overlap, presents a fundamental inefficiency for current video multimodal large language models (video MLLMs). These models typically process each sampled frame as an independent image, leading to redundant visual tokens and inflated computational costs. A new approach, detailed on arXiv, challenges this paradigm by proposing a more dynamic and efficient video interface.

Predictive Visual Coding for Reduced Redundancy

The core innovation lies in a 'predictive visual code' that intelligently manages visual token transmission. Instead of encoding every frame fully, this system, instantiated as AdaCodec, selectively transmits a full reference frame only when scene prediction is unreliable. Otherwise, it encodes inter-frame changes, encompassing motion and prediction residuals, using compact 'P-tokens'. This adaptive strategy significantly minimizes the number of visual tokens required for video understanding.

Substantial Gains in Efficiency and Performance

AdaCodec demonstrates marked improvements over the baseline Qwen3-VL-8B model across eleven benchmarks. Even at a drastically reduced token budget (1/7th), AdaCodec with 32k tokens outperforms the 224k baseline on all long-video benchmarks. Furthermore, for general-video benchmarks, it not only elevates average scores but also slashes the time-to-first-token from 9.26s to a mere 1.62s. This efficiency leap makes real-time video analysis and interaction far more feasible.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#AI Research #Video Understanding #LLM Efficiency #Computer Vision