Current unified world models struggle to bridge the gap between efficient real-time robotic action and high-fidelity 4D world synthesis. Existing approaches often remain confined to 2D pixel spaces, compromising either action execution speed or the quality of spatial understanding.
Unifying Action and 4D Synthesis with X-WAM
The X-WAM unified 4D world model emerges as a singular framework to address this dichotomy, integrating real-time robotic action execution with high-fidelity 4D world synthesis, encompassing both video and 3D reconstruction. This novel architecture leverages the potent visual priors of pretrained video diffusion models to project future world states. A key innovation is its efficient spatial information extraction via a lightweight structural adaptation: replicating the final blocks of a pretrained Diffusion Transformer to a dedicated depth prediction branch. This design allows for the reconstruction of future spatial information without the prohibitive computational overhead of full 3D modeling.