The current paradigm in embodied AI research suffers from fragmentation, with specialized models tackling individual tasks like manipulation or navigation. This approach limits generalization across diverse robot embodiments, environments, and task families. The development of a unified embodied foundation model addresses this critical bottleneck.
Related startups
Unifying Embodied Decision-Making
The researchers introduce Qwen-VLA, a unified embodied foundation model designed to tackle heterogeneous embodied decision-making problems. By extending Qwen's vision-language capabilities to continuous action and trajectory generation via a DiT-based action decoder, Qwen-VLA bridges the gap between perception, reasoning, and physical action. This unified architecture is trained on a large-scale, diverse dataset encompassing robotics trajectories, human demonstrations, synthetic data, and vision-and-language navigation data, promoting a holistic understanding of embodied tasks.
Embodiment-Aware Generalization
A key innovation is the introduction of embodiment-aware prompt conditioning. This allows Qwen-VLA to adapt to multiple robot platforms by specifying the current embodiment and control convention through textual descriptions. This mechanism, coupled with a unified action-and-trajectory prediction framework, enables transferable visual grounding, spatial reasoning, and continuous action generation. Experiments highlight Qwen-VLA's robust performance and out-of-distribution generalization capabilities across variations in scene layout, lighting, object configurations, and critically, robot embodiment. The model achieved impressive results on benchmarks such as LIBERO (97.9%), Simpler-WidowX (73.7%), RoboTwin (86.1%/87.2%), and real-world ALOHA experiments (76.9% average OOD success).