The pursuit of truly intelligent embodied agents has long been hampered by the need to train disparate, specialized models for perception, reasoning, and action. This fragmentation leads to inefficiencies and limits the holistic capabilities of AI systems. The introduction of Pelican-Unified 1.0 marks a significant departure, presenting the first embodied foundation model built on the principle of unification.
Related startups
Unifying Perception, Reasoning, and Imagination
Pelican-Unified 1.0 leverages a single Visual-Language Model (VLM) to serve as a unified understanding and reasoning module. This VLM maps diverse inputs—scenes, instructions, visual contexts, and action histories—into a shared semantic space. Crucially, it also performs autoregressive chain-of-thought reasoning, generating task- and action-oriented sequences in a single pass. This unified approach allows for the backpropagation of language, video, and action losses into the shared representation, enabling simultaneous optimization of understanding, reasoning, imagination, and action, rather than relying on isolated expert systems.
Specialist Strength Without Compromise
Contrary to the intuition that unification might lead to diluted capabilities, Pelican-Unified 1.0 demonstrates that this paradigm can preserve and even enhance specialist performance. A single checkpoint of the model achieved impressive results across multiple domains: 64.7 on eight VLM benchmarks (outperforming comparable-scale models), a first-place ranking of 66.03 on WorldArena, and 93.5 on RoboTwin (second-best among action methods). These findings underscore the efficacy of the unified approach in consolidating complex AI capabilities without sacrificing individual performance.