AdaCodec: Efficient Video MLLM Encoding

AdaCodec revolutionizes video MLLMs by using predictive visual coding to drastically cut tokenization costs and latency, achieving superior performance at a fraction of the budget.

7 min read
Diagram illustrating AdaCodec's adaptive visual tokenization strategy for video MLLMs.
AdaCodec's adaptive approach to visual tokenization.

The inherent temporal redundancy in video, where adjacent frames largely overlap, presents a fundamental inefficiency for current video multimodal large language models (video MLLMs). These models typically process each sampled frame as an independent image, leading to redundant visual tokens and inflated computational costs. A new approach, detailed on arXiv, challenges this paradigm by proposing a more dynamic and efficient video interface.

Visual TL;DR. Video MLLM Inefficiency problem AdaCodec Introduced. Video MLLM Inefficiency leads to Temporal Redundancy. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding leads to Selective Frame Encoding. Predictive Visual Coding generates Compact P-tokens. Selective Frame Encoding leads to Reduced Token Count. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance.

Related startups

  1. Video MLLM Inefficiency: processing adjacent frames as independent images leads to redundant tokens
  2. Temporal Redundancy: adjacent video frames largely overlap, causing inflated computational costs
  3. AdaCodec Introduced: a new dynamic and efficient video interface for MLLMs
  4. Predictive Visual Coding: intelligently manages visual token transmission based on scene prediction
  5. Selective Frame Encoding: transmits full reference frames only when scene prediction is unreliable
  6. Compact P-tokens: encodes inter-frame changes like motion and prediction residuals
  7. Reduced Token Count: significantly minimizes visual tokens required for video understanding
  8. Efficiency Gains: drastically cuts tokenization costs and latency for video MLLMs
  9. Superior Performance: achieves better results at a fraction of the computational budget
Visual TL;DR
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding generates Compact P-tokens. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLM Inefficiency AdaCodec Introduced Predictive Visual Coding Compact P-tokens Reduced Token Count Efficiency Gains Superior Performance From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding generates Compact P-tokens. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLMInefficiency AdaCodecIntroduced Predictive VisualCoding Compact P-tokens Reduced TokenCount Efficiency Gains SuperiorPerformance From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding generates Compact P-tokens. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLM Inefficiency processing adjacent frames as independentimages leads to redundant tokens AdaCodec Introduced a new dynamic and efficient videointerface for MLLMs Predictive Visual Coding intelligently manages visual tokentransmission based on scene prediction Compact P-tokens encodes inter-frame changes like motionand prediction residuals Reduced Token Count significantly minimizes visual tokensrequired for video understanding Efficiency Gains drastically cuts tokenization costs andlatency for video MLLMs Superior Performance achieves better results at a fraction ofthe computational budget From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding generates Compact P-tokens. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLMInefficiency processing adjacentframes asindependent images… AdaCodecIntroduced a new dynamic andefficient videointerface for MLLMs Predictive VisualCoding intelligentlymanages visualtoken transmission… Compact P-tokens encodes inter-framechanges like motionand prediction… Reduced TokenCount significantlyminimizes visualtokens required for… Efficiency Gains drastically cutstokenization costsand latency for… SuperiorPerformance achieves betterresults at afraction of the… From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. Video MLLM Inefficiency leads to Temporal Redundancy. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding leads to Selective Frame Encoding. Predictive Visual Coding generates Compact P-tokens. Selective Frame Encoding leads to Reduced Token Count. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLM Inefficiency processing adjacent frames as independentimages leads to redundant tokens Temporal Redundancy adjacent video frames largely overlap,causing inflated computational costs AdaCodec Introduced a new dynamic and efficient videointerface for MLLMs Predictive Visual Coding intelligently manages visual tokentransmission based on scene prediction Selective Frame Encoding transmits full reference frames only whenscene prediction is unreliable Compact P-tokens encodes inter-frame changes like motionand prediction residuals Reduced Token Count significantly minimizes visual tokensrequired for video understanding Efficiency Gains drastically cuts tokenization costs andlatency for video MLLMs Superior Performance achieves better results at a fraction ofthe computational budget From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Video MLLM Inefficiency problem AdaCodec Introduced. Video MLLM Inefficiency leads to Temporal Redundancy. AdaCodec Introduced uses Predictive Visual Coding. Predictive Visual Coding leads to Selective Frame Encoding. Predictive Visual Coding generates Compact P-tokens. Selective Frame Encoding leads to Reduced Token Count. Compact P-tokens contributes to Reduced Token Count. Reduced Token Count enables Efficiency Gains. Reduced Token Count leads to Superior Performance problem uses generates contributes to enables leads to Video MLLMInefficiency processing adjacentframes asindependent images… TemporalRedundancy adjacent videoframes largelyoverlap, causing… AdaCodecIntroduced a new dynamic andefficient videointerface for MLLMs Predictive VisualCoding intelligentlymanages visualtoken transmission… Selective FrameEncoding transmits fullreference framesonly when scene… Compact P-tokens encodes inter-framechanges like motionand prediction… Reduced TokenCount significantlyminimizes visualtokens required for… Efficiency Gains drastically cutstokenization costsand latency for… SuperiorPerformance achieves betterresults at afraction of the… From startuphub.ai · The publishers behind this format

Predictive Visual Coding for Reduced Redundancy

The core innovation lies in a 'predictive visual code' that intelligently manages visual token transmission. Instead of encoding every frame fully, this system, instantiated as AdaCodec, selectively transmits a full reference frame only when scene prediction is unreliable. Otherwise, it encodes inter-frame changes—encompassing motion and prediction residuals—using compact 'P-tokens'. This adaptive strategy significantly minimizes the number of visual tokens required for video understanding.

Substantial Gains in Efficiency and Performance

AdaCodec demonstrates marked improvements over the baseline Qwen3-VL-8B model across eleven benchmarks. Even at a drastically reduced token budget (1/7th), AdaCodec with 32k tokens outperforms the 224k baseline on all long-video benchmarks. Furthermore, for general-video benchmarks, it not only elevates average scores but also slashes the time-to-first-token from 9.26s to a mere 1.62s. This efficiency leap makes real-time video analysis and interaction far more feasible.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.