Generating music that precisely aligns with video events has remained a significant hurdle for existing text-to-music models, primarily due to their deficiency in fine-grained temporal control. This limitation hinders the creation of truly immersive audiovisual experiences.
Event Curves: Unlocking Temporal Structure Independently
The core innovation behind V2M-Zero lies in a novel observation: temporal synchronization hinges on matching the *timing* and *rate of change* within events, rather than their semantic content. While visual and musical events are semantically disparate, they share an underlying temporal structure. V2M-Zero captures this structure by computing 'event curves' from intra-modal similarity using pre-trained music and video encoders. By independently measuring temporal change within each modality, these curves provide comparable representations, enabling a zero-pair training strategy.