Current latent world models excel at short-term video prediction but falter in capturing long-horizon semantics due to limited temporal context. Conversely, vision-language models (VLMs) offer rich semantic understanding but are hindered by compute-driven sparse sampling and a language-output bottleneck that compresses fine-grained interactions. This creates a gap for applications requiring both detailed motion forecasting and broad semantic reasoning.
Dual-Pathway Fusion: Dense Dynamics Meets Semantic Guidance
To address this, researchers propose a VLM-guided JEPA-style latent world modeling framework that synergistically combines dense-frame dynamics modeling with long-horizon semantic guidance. This is achieved via a dual-temporal pathway. A dense JEPA branch handles fine-grained motion and interaction cues, crucial for understanding immediate physical states. Concurrently, a uniformly sampled VLM 'thinker' branch operates with a larger temporal stride, providing knowledge-rich guidance to inform longer-term predictions.
Hierarchical VLM Representation for Effective Guidance
A key innovation is the hierarchical pyramid representation extraction module. This module effectively aggregates multi-layer VLM representations into guidance features. This process ensures that the progressive reasoning signals from the VLM are transferred efficiently and are compatible with the latent prediction objectives of the JEPA component. Experiments on hand-manipulation trajectory prediction demonstrate that this integrated approach, referred to as VLM-guided JEPA latent world modeling, outperforms both VLM-only and JEPA-predictor baselines, exhibiting superior long-horizon rollout robustness according to arXiv.