Meta has unveiled Segment Anything Model 3 (SAM 3), a unified foundational model that finally brings robust, open-vocabulary language understanding to visual segmentation and tracking.
The original Segment Anything Model (SAM) was a watershed moment for computer vision, allowing users to instantly mask any object in an image using simple visual prompts like points or bounding boxes. Now, Meta is pushing the technology into the realm of true multimodal understanding with Segment Anything Model 3 (SAM 3), a unified system for detection, segmentation, and tracking that responds directly to complex text prompts.
This release is arguably the most significant advancement in the SAM lineage since its inception. SAM 3 overcomes the core limitation of its predecessors: the inability to handle open-vocabulary concepts defined by language. Traditional models could segment a "person," but they struggled with nuanced requests like "the striped red umbrella" or "people sitting down, but not holding a gift box in their hands." SAM 3 solves this by introducing promptable concept segmentation, accepting short noun phrases or image exemplars to define the target concept.
