Instance-Aware VLP: Beyond Global Understanding

InstAP introduces instance-aware pre-training for VLP, enhancing instance-level reasoning and global understanding with the InstVL dataset.

2 min read
Instance-Aware VLP: Beyond Global Understanding

Current vision-language pre-training (VLP) models, while adept at grasping the essence of an entire scene, falter when tasked with precise, instance-level comprehension. This limitation stems from a reliance on global-only supervision. To bridge this gap, researchers have introduced InstAP, an Instance-Aware Pre-training framework designed to jointly optimize for both broad scene alignment and granular, instance-specific contrastive alignment. This is achieved by grounding textual mentions to discrete spatial-temporal regions within images and videos.

Bridging the Granularity Chasm with InstVL

The foundation of this advancement is the large-scale InstVL dataset, comprising 2 million images and 50,000 videos. InstVL uniquely offers dual-granularity annotations: comprehensive captions for holistic scene understanding and dense, grounded descriptions pinpointing specific instances. This curated dataset empowers the InstAP framework to move beyond diffuse, scene-level attention.

Superior Instance Retrieval and Global Competitiveness

On the InstVL benchmark, InstAP demonstrates a substantial performance leap over existing VLP models in instance-level retrieval tasks. Crucially, when trained on the exact same data corpus, InstAP still outperforms a strong VLP baseline, isolating the performance gains to the novel instance-aware objective. Furthermore, this instance-centric pre-training does not come at the expense of global understanding. InstAP achieves competitive zero-shot performance across multiple established video benchmarks, including MSR-VTT and DiDeMo, showcasing its multifaceted capabilities.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.