UniMotion: Unifying Motion, Vision, and Language

UniMotion establishes a unified framework for continuous motion, vision, and text, overcoming discrete tokenization limits and achieving SOTA cross-modal performance.

2 min read
UniMotion: Unifying Motion, Vision, and Language

The current landscape of multi-modal AI often forces a compromise between discrete representations and limited modality integration. Existing unified models struggle with the inherent temporal continuity of human motion, often resorting to tokenization that introduces errors and fragments sequences. This limitation hinders true generative and understanding capabilities across modalities.

Motion as a First-Class Continuous Modality

The UniMotion framework fundamentally shifts this paradigm by treating human motion as a continuous, first-class modality on par with RGB images and natural language. This core principle is realized through a novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders. These components create parallel continuous pathways within a shared LLM backbone, enabling seamless integration and processing of motion data without the pitfalls of discrete tokenization.

Related startups

Injecting Rich Priors and Overcoming Cold-Start Challenges

UniMotion addresses two critical challenges in multi-modal motion learning. First, to imbue motion representations with visual-semantic understanding without requiring images during inference, the Dual-Posterior KL Alignment (DPA) technique distills knowledge from a vision-fused encoder into a motion-only encoder. Second, to counteract the sparsity of text supervision for the newly introduced motion pathway, the Latent Reconstruction Alignment (LRA) self-supervised pre-training strategy leverages dense motion latents. This strategy co-calibrates the embedder, backbone, and flow head, establishing a robust, motion-aware foundation for all subsequent tasks.

State-of-the-Art Cross-Modal Performance

The efficacy of the UniMotion framework is demonstrated through its state-of-the-art performance across seven diverse tasks. These tasks span any-to-any understanding, generation, and editing among human motion, natural language, and RGB images. The framework exhibits particular strength in complex cross-modal compositional tasks, signaling a significant leap forward in unified multi-modal AI capabilities.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.