Instance-Aware VLP: Beyond Global Understanding

Current vision-language pre-training (VLP) models, while adept at grasping the essence of an entire scene, falter when tasked with precise, instance-level comprehension. This limitation stems from a reliance on global-only supervision. To bridge this gap, researchers have introduced InstAP, an Instance-Aware Pre-training framework designed to jointly optimize for both broad scene alignment and granular, instance-specific contrastive alignment. This is achieved by grounding textual mentions to discrete spatial-temporal regions within images and videos.

Bridging the Granularity Chasm with InstVL

The foundation of this advancement is the large-scale InstVL dataset, comprising 2 million images and 50,000 videos. InstVL uniquely offers dual-granularity annotations: comprehensive captions for holistic scene understanding and dense, grounded descriptions pinpointing specific instances. This curated dataset empowers the InstAP framework to move beyond diffuse, scene-level attention.

Superior Instance Retrieval and Global Competitiveness

On the InstVL benchmark, InstAP demonstrates a substantial performance leap over existing VLP models in instance-level retrieval tasks. Crucially, when trained on the exact same data corpus, InstAP still outperforms a strong VLP baseline, isolating the performance gains to the novel instance-aware objective. Furthermore, this instance-centric pre-training does not come at the expense of global understanding. InstAP achieves competitive zero-shot performance across multiple established video benchmarks, including MSR-VTT and DiDeMo, showcasing its multifaceted capabilities.