Current latent world models excel at short-term video prediction but falter in capturing long-horizon semantics due to limited temporal context. Conversely, vision-language models (VLMs) offer rich semantic understanding but are hindered by compute-driven sparse sampling and a language-output bottleneck that compresses fine-grained interactions. This creates a gap for applications requiring both detailed motion forecasting and broad semantic reasoning.
Dual-Pathway Fusion: Dense Dynamics Meets Semantic Guidance
To address this, researchers propose a VLM-guided JEPA-style latent world modeling framework that synergistically combines dense-frame dynamics modeling with long-horizon semantic guidance. This is achieved via a dual-temporal pathway. A dense JEPA branch handles fine-grained motion and interaction cues, crucial for understanding immediate physical states. Concurrently, a uniformly sampled VLM 'thinker' branch operates with a larger temporal stride, providing knowledge-rich guidance to inform longer-term predictions.