V2M-Zero: Temporal Music Sync Without Paired Data

V2M-Zero revolutionizes video-to-music generation by using event curves to achieve temporal synchronization without paired data, achieving significant performance gains.

Mar 12 at 1:15 PM2 min read

V2M-Zero: Temporal Music Sync Without Paired Data

Generating music that precisely aligns with video events has remained a significant hurdle for existing text-to-music models, primarily due to their deficiency in fine-grained temporal control. This limitation hinders the creation of truly immersive audiovisual experiences.

Event Curves: Unlocking Temporal Structure Independently

The core innovation behind V2M-Zero lies in a novel observation: temporal synchronization hinges on matching the *timing* and *rate of change* within events, rather than their semantic content. While visual and musical events are semantically disparate, they share an underlying temporal structure. V2M-Zero captures this structure by computing 'event curves' from intra-modal similarity using pre-trained music and video encoders. By independently measuring temporal change within each modality, these curves provide comparable representations, enabling a zero-pair training strategy.

Related startups

Zero-Pair Training for Broad Applicability

This method allows for a remarkably simple training paradigm. A standard text-to-music model is fine-tuned on music-event curves. Crucially, at inference time, video-event curves are substituted without requiring any cross-modal training or paired data. This drastically lowers the barrier to entry for video-to-music generation, moving beyond the constraints of expensive, meticulously aligned datasets. The approach demonstrates substantial gains across multiple benchmarks (OES-Pub, MovieGenBench-Music, AIST++), outperforming paired-data baselines by significant margins in audio quality (5-21%), semantic alignment (13-15%), temporal synchronization (21-52%), and even beat alignment (28%) on dance videos. These findings were further validated through extensive crowd-sourced subjective listening tests, confirming that temporal alignment can be effectively achieved through within-modality features.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#AI Research #Generative AI #Computer Vision #Audio Generation #Machine Learning

AI Daily Digest

Get the most important AI news daily.

+40k readers