ActiveSAM: Efficient Open-Vocabulary Segmentation

The promise of large foundation models like Segment Anything Model 3 (SAM 3) for concept-prompted segmentation is immense, yet their direct application to open-vocabulary semantic segmentation (OVSS) faces a critical bottleneck: computational inefficiency. Traditional methods demand full-resolution decoding across the entire dataset vocabulary for every image, ignoring the reality that each image contains only a sparse subset of relevant classes. Addressing this, ActiveSAM emerges as a training-free, zero-shot inference framework designed to transform SAM 3 into an active-vocabulary segmenter.

Visual TL;DR. SAM 3 Inefficiency problem ActiveSAM Framework. ActiveSAM Framework introduces Preview-Driven Selection. Preview-Driven Selection involves Canonicalize Class Prompts. Preview-Driven Selection enables Skip Unnecessary Computation. Preview-Driven Selection leads to Boosted Speed & Accuracy. Boosted Speed & Accuracy and Enhanced Robustness.

Related startups

SAM 3 Inefficiency: full-resolution decoding across entire dataset vocabulary for every image
ActiveSAM Framework: training-free, zero-shot inference framework for active-vocabulary segmentation
Preview-Driven Selection: estimates an image-conditioned active set from a low-resolution presence preview
Canonicalize Class Prompts: expands class prompts for more relevant and efficient identification
Skip Unnecessary Computation: intelligently skips computation in segmentation based on presence evidence
Boosted Speed & Accuracy: dynamically identifies relevant classes, significantly improving segmentation performance
Enhanced Robustness: better performance for real-world AI applications with diverse data

Visual TL;DRQuickExplainDeeper

Preview-Driven Active Vocabulary Selection

ActiveSAM introduces a novel approach to tackle OVSS inefficiency. The framework first canonicalizes and expands class prompts. Crucially, it then estimates an image-conditioned active set from a low-resolution 'presence preview'. This preview stage leverages only class-presence evidence, intelligently skipping unnecessary computation in the segmentation head. Only the classes identified as relevant in this preview are subsequently decoded at full resolution. This selective processing, combined with bucketed prompt multiplexing using the frozen SAM 3 decoder, dramatically reduces computational overhead without requiring any target-dataset training, weight updates, or oracle class-presence labels.

Enhanced Speed-Accuracy and Robustness

The performance gains of ActiveSAM are substantial. Across eight OVSS benchmarks, the framework demonstrates a superior speed-accuracy tradeoff compared to existing methods. It notably outperforms the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average, while achieving speeds up to 5.5x faster on large-vocabulary datasets. Beyond raw performance, ActiveSAM exhibits remarkable robustness under image corruption that simulates real-world distribution shifts. This resilience makes it particularly well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI, where reliable segmentation is paramount.

ActiveSAM: Efficient Open-Vocabulary Segmentation

Related startups

Preview-Driven Active Vocabulary Selection

Enhanced Speed-Accuracy and Robustness

AI Daily Digest