The generation of motion-controlled videos has been hampered by an inability to disentangle object dynamics from camera perspectives and a lack of causal reasoning, leading to animations that merely displace pixels rather than simulating realistic reactions. This limitation restricts user control and the plausibility of generated scenes.
Unifying Disentangled Motion and Causal Dynamics
The MoRight framework represents a significant advancement by addressing these core challenges. It introduces a unified approach that disentangles object motion control from arbitrary camera viewpoint adjustments. This is achieved by specifying object motion in a canonical static view and then transferring it to the target camera perspective using temporal cross-view attention. This architectural innovation allows for independent control over scene dynamics and camera framing, a critical step beyond existing methods that conflate these signals.
Modeling Motion Causality for Realistic Interactions
Beyond control, MoRight tackles the fundamental challenge of motion causality. The framework decomposes motion into 'active' (user-driven) and 'passive' (consequential) components. By training the model to learn these causal relationships from data, MoRight generates videos where actions trigger coherent and physically plausible reactions from other objects in the scene. This moves beyond simple kinematic displacement to a more semantically aware simulation of motion. The system’s flexibility is further highlighted by its inference capabilities: users can either provide an active motion and observe the predicted consequences (forward reasoning) or specify a desired passive outcome and have MoRight infer the necessary driving actions (inverse reasoning), all while maintaining free camera control. Experiments across three benchmarks confirm MoRight's state-of-the-art performance in generation quality, controllability, and interaction awareness, positioning MoRight video generation as a new standard.