The pervasive bottleneck in scaling Vision-Language-Action (VLA) models is the prohibitive cost of collecting expert demonstrations. This paper introduces a paradigm shift by arguing that the current approach conflates two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision.
Related startups
Decomposing Embodied Learning: The TAP Framework
Building on this "Decomposition Hypothesis," the researchers propose Task-Agnostic Pretraining (TAP). This novel two-stage framework first learns highly transferable motor priors from abundant, unlabeled interaction data. This includes discarded off-task trajectories and autonomous robot play, leveraging a self-supervised Inverse Dynamics objective. A subsequent, lightweight stage then grounds these robust physical representations in language using a minimal amount of expert data.
Orders of Magnitude Efficiency and Robustness Gains
On the SIMPLER benchmark, TAP demonstrates remarkable efficiency, matching models trained on over 1 million expert trajectories while utilizing orders of magnitude less labeled data. This approach yields a 10% absolute performance gain over standard behavior cloning. Critically, on a real-world WidowX platform, TAP retains 25% success under camera perturbations that cause internet-scale baselines to collapse entirely to 0% success. This highlights TAP's ability to produce robust, transferable physical representations, offering a truly scalable path forward for Embodied AI.