UniMotion: Unifying Motion, Vision, and Language

The current landscape of multi-modal AI often forces a compromise between discrete representations and limited modality integration. Existing unified models struggle with the inherent temporal continuity of human motion, often resorting to tokenization that introduces errors and fragments sequences. This limitation hinders true generative and understanding capabilities across modalities.

Motion as a First-Class Continuous Modality

The UniMotion framework fundamentally shifts this paradigm by treating human motion as a continuous, first-class modality on par with RGB images and natural language. This core principle is realized through a novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders. These components create parallel continuous pathways within a shared LLM backbone, enabling seamless integration and processing of motion data without the pitfalls of discrete tokenization.

Injecting Rich Priors and Overcoming Cold-Start Challenges

UniMotion addresses two critical challenges in multi-modal motion learning. First, to imbue motion representations with visual-semantic understanding without requiring images during inference, the Dual-Posterior KL Alignment (DPA) technique distills knowledge from a vision-fused encoder into a motion-only encoder. Second, to counteract the sparsity of text supervision for the newly introduced motion pathway, the Latent Reconstruction Alignment (LRA) self-supervised pre-training strategy leverages dense motion latents. This strategy co-calibrates the embedder, backbone, and flow head, establishing a robust, motion-aware foundation for all subsequent tasks.

State-of-the-Art Cross-Modal Performance

The efficacy of the UniMotion framework is demonstrated through its state-of-the-art performance across seven diverse tasks. These tasks span any-to-any understanding, generation, and editing among human motion, natural language, and RGB images. The framework exhibits particular strength in complex cross-modal compositional tasks, signaling a significant leap forward in unified multi-modal AI capabilities.

UniMotion: Unifying Motion, Vision, and Language

Motion as a First-Class Continuous Modality

Related startups

Injecting Rich Priors and Overcoming Cold-Start Challenges

State-of-the-Art Cross-Modal Performance

AI Daily Digest