The promise of large foundation models like Segment Anything Model 3 (SAM 3) for concept-prompted segmentation is immense, yet their direct application to open-vocabulary semantic segmentation (OVSS) faces a critical bottleneck: computational inefficiency. Traditional methods demand full-resolution decoding across the entire dataset vocabulary for every image, ignoring the reality that each image contains only a sparse subset of relevant classes. Addressing this, ActiveSAM emerges as a training-free, zero-shot inference framework designed to transform SAM 3 into an active-vocabulary segmenter.
Related startups
Preview-Driven Active Vocabulary Selection
ActiveSAM introduces a novel approach to tackle OVSS inefficiency. The framework first canonicalizes and expands class prompts. Crucially, it then estimates an image-conditioned active set from a low-resolution 'presence preview'. This preview stage leverages only class-presence evidence, intelligently skipping unnecessary computation in the segmentation head. Only the classes identified as relevant in this preview are subsequently decoded at full resolution. This selective processing, combined with bucketed prompt multiplexing using the frozen SAM 3 decoder, dramatically reduces computational overhead without requiring any target-dataset training, weight updates, or oracle class-presence labels.
Enhanced Speed-Accuracy and Robustness
The performance gains of ActiveSAM are substantial. Across eight OVSS benchmarks, the framework demonstrates a superior speed-accuracy tradeoff compared to existing methods. It notably outperforms the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average, while achieving speeds up to 5.5x faster on large-vocabulary datasets. Beyond raw performance, ActiveSAM exhibits remarkable robustness under image corruption that simulates real-world distribution shifts. This resilience makes it particularly well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI, where reliable segmentation is paramount.