The current landscape of multi-modal AI often forces a compromise between discrete representations and limited modality integration. Existing unified models struggle with the inherent temporal continuity of human motion, often resorting to tokenization that introduces errors and fragments sequences. This limitation hinders true generative and understanding capabilities across modalities.
Motion as a First-Class Continuous Modality
The UniMotion framework fundamentally shifts this paradigm by treating human motion as a continuous, first-class modality on par with RGB images and natural language. This core principle is realized through a novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders. These components create parallel continuous pathways within a shared LLM backbone, enabling seamless integration and processing of motion data without the pitfalls of discrete tokenization.