Segment Anything Model (SAM)
2025-08-07
This is my learning note for Segment Anything Model (SAM).
Main reference: Kirillov et al., "Segment Anything" (arXiv:2304.02643)
1. Introduction
The Segment Anything Model (SAM) represents a paradigm shift in image segmentation, positioning itself as a foundation model analogous to large language models in NLP. Its primary goal is to enable general-purpose, prompt-driven segmentation with zero-shot generalization, meaning it can segment novel objects without additional training.
SAM introduces the concept of promptable segmentation, where segmentation is treated as a task conditioned on flexible prompts — such as points, bounding boxes, or masks. It is trained on a uniquely large dataset (SA-1B), allowing it to generalize across domains and tasks.
2. Core Ideas and Design Principles
2.1 Promptable Segmentation
SAM redefines segmentation as a prompt-conditioned task. The model does not output segmentation maps unconditionally. Instead, it receives a prompt specifying what to segment, and it returns a corresponding mask.
Supported prompt types:
- Points: foreground/background clicks
- Bounding boxes
- Masks: optional prior masks for refinement
This formulation enables:
- Interactive segmentation in real time
- Composability for downstream systems
- Zero-shot transfer across segmentation use cases
2.2 Foundation Model Perspective
SAM adopts the foundation model philosophy:
- Train once on a massive, diverse dataset
- Generalize to a wide range of tasks without retraining
- Modular architecture for extensibility
Like GPT for text, SAM is designed to be used as a general segmentation engine across tasks and domains.
2.3 Decoupled Design
SAM separates image and prompt processing:
- The image encoder embeds the image once.
- The prompt encoder transforms user inputs into learned embeddings.
- The mask decoder combines both to predict masks.
This architectural separation allows:
- Fast inference (image embedding reused)
- Efficient processing of multiple prompts per image
- Modular use in systems with varying interaction requirements
3. Architecture Overview
SAM comprises three primary components:
Component | Role |
---|---|
Image Encoder | Produces a dense embedding of the input image |
Prompt Encoder | Encodes spatial prompts (points, boxes, masks) |
Mask Decoder | Combines image and prompt embeddings to produce masks |
3.1 Image Encoder
- Backbone: ViT-Huge (ViT-H) transformer
- Input resolution: 1024×1024
- Output: 64×64 grid of patch tokens
The image encoder operates once per image, making subsequent segmentation conditioned only on the prompt. This allows fast and scalable inference.
3.2 Prompt Encoder
Input: prompts as spatial coordinates (points, boxes, masks)
Encoded via learned embeddings and positional encodings
Supports:
- Foreground/background points (positive/negative)
- Bounding boxes
- Optional coarse masks (encoded as low-res spatial maps)
The encoded prompts are transformed into tokens that are fed to the decoder along with image features.
3.3 Mask Decoder
Architecture: Lightweight Transformer decoder
Inputs:
- Sparse prompt tokens
- Dense image embeddings (from ViT)
Outputs:
- 3 candidate masks (per prompt)
- Associated mask quality/confidence scores
Each mask corresponds to a possible interpretation of the prompt. This multi-mask output accounts for segmentation ambiguity and supports interactive refinement.
3.4 Inference Pipeline
Given an input image and prompt:
Image → Image Encoder → Image Embedding
⇩
Prompt → Prompt Encoder → Prompt Embedding
⇩
Mask Decoder → Candidate Segmentation Masks
4. Training Methodology
SAM is trained on Segment Anything 1-Billion (SA-1B), the largest known segmentation dataset.
4.1 Dataset: SA-1B
Over 1 billion masks
Spanning 11 million high-resolution images
Labeled via a semi-automated interactive system:
- A model-in-the-loop suggests masks
- Human annotators correct/guide results
- System iteratively improves during data collection
4.2 Training Objective
- Supervised learning using prompt-mask pairs
- Mask loss: binary cross-entropy + IoU-based losses
- Diverse prompt types included during training
- Random augmentations and prompt sampling used to simulate various use conditions
5. Generalization and Capabilities
SAM demonstrates strong zero-shot generalization to:
- Natural images, indoor/outdoor scenes
- Art, satellite imagery, historical photos
- Synthetic and scientific images (e.g., microscopy)
However, it does not produce class labels — SAM only returns masks, not semantic categories.
Advantages:
- Works with unseen objects
- Supports interactive segmentation
- Can be composed with other models (e.g., CLIP, GroundingDINO)
Limitations:
- Coarse masks on fine structures or small objects
- Large model (ViT-H is resource-intensive)
- Requires post-processing for instance tracking or semantic labeling
6. Theoretical Impact and Innovation
Innovation | Implication |
---|---|
Promptable segmentation | Unifies multiple segmentation tasks under one model |
Decoupled image/prompt encoders | Enables fast reuse and interactivity |
Mask ambiguity output | Captures uncertainty, supports user correction |
Large-scale interactive data | Makes generalization feasible |
Foundation model for vision | First step toward vision models like GPT in NLP |
7. Extensions and Ecosystem
SAM has sparked a wide ecosystem of extensions:
- Grounded-SAM: combines SAM with text-grounding (e.g., from CLIP)
- MedSAM: fine-tuned SAM for medical imaging
- MobileSAM: efficient variant for edge devices
- SAM + DETR/Mask R-CNN: instance-aware adaptations
SAM is increasingly being used as a building block in broader CV pipelines.
8. Summary
Segment Anything Model (SAM) is a landmark foundation model in computer vision, enabling prompt-driven, general-purpose segmentation at scale. Its architecture and training methodology exemplify how prompt-based designs and massive pretraining can lead to strong zero-shot generalization.
SAM is not just a model — it is a framework for interactive, flexible, and extensible segmentation tasks.
Suggested Reading
- Segment Anything (arXiv:2304.02643)
- Meta AI’s official GitHub repository
- SA-1B Dataset description and annotation strategy
- Related models: GroundingDINO, CLIP, Mask2Former