The scaling of humanoid foundation models faces a critical bottleneck: the scarcity of robotic data. While vast amounts of human-centric egocentric data exist, the kinematic differences between humans and humanoids present a significant hurdle for effective transfer. This challenge is addressed by UniT (Unified Latent Action Tokenizer via Visual Anchoring), a novel framework designed to forge a universal physical language for human-to-humanoid transfer.
Forging a Universal Physical Language
UniT operates on the principle that disparate kinematic structures share universal visual consequences. The framework employs a tri-branch cross-reconstruction mechanism. Actions are used to predict vision, effectively anchoring kinematics to physical outcomes. Concurrently, vision reconstructs actions, filtering out irrelevant visual confounders. A crucial fusion branch then synergizes these purified modalities into a shared, discrete latent space representing embodiment-agnostic physical intents. This approach creates a common ground for understanding and transferring actions across different embodiments.