Generating music that precisely aligns with video events has remained a significant hurdle for existing text-to-music models, primarily due to their deficiency in fine-grained temporal control. This limitation hinders the creation of truly immersive audiovisual experiences.
Event Curves: Unlocking Temporal Structure Independently
The core innovation behind V2M-Zero lies in a novel observation: temporal synchronization hinges on matching the *timing* and *rate of change* within events, rather than their semantic content. While visual and musical events are semantically disparate, they share an underlying temporal structure. V2M-Zero captures this structure by computing 'event curves' from intra-modal similarity using pre-trained music and video encoders. By independently measuring temporal change within each modality, these curves provide comparable representations, enabling a zero-pair training strategy.
Zero-Pair Training for Broad Applicability
This method allows for a remarkably simple training paradigm. A standard text-to-music model is fine-tuned on music-event curves. Crucially, at inference time, video-event curves are substituted without requiring any cross-modal training or paired data. This drastically lowers the barrier to entry for video-to-music generation, moving beyond the constraints of expensive, meticulously aligned datasets. The approach demonstrates substantial gains across multiple benchmarks (OES-Pub, MovieGenBench-Music, AIST++), outperforming paired-data baselines by significant margins in audio quality (5-21%), semantic alignment (13-15%), temporal synchronization (21-52%), and even beat alignment (28%) on dance videos. These findings were further validated through extensive crowd-sourced subjective listening tests, confirming that temporal alignment can be effectively achieved through within-modality features.


