SAM
约 562 字大约 2 分钟
ViT
2025-08-08
These are my notes on the Segment Anything Model (SAM).
Reference: Kirillov et al., Segment Anything, arXiv:2304.02643
1. Introduction
The Segment Anything Model (SAM) is an important milestone in image segmentation. It was designed as a general-purpose foundation model for prompt-based image segmentation, with strong zero-shot generalization. In other words, it aims to segment previously unseen objects without task-specific retraining.
SAM introduces a key idea: promptable segmentation. Instead of directly outputting a full segmentation map for an image, the model receives a prompt, such as a click, a box, or an initial mask, and returns the mask corresponding to that prompt.
The model is trained on the large-scale SA-1B dataset, which is the main reason it generalizes across domains and downstream tasks.
2. Core ideas and design principles
2.1 Promptable segmentation
SAM reformulates segmentation as a conditional task.
Supported prompt types include:
- points: foreground or background clicks
- boxes
- masks: optional coarse masks
This design brings several practical advantages:
- it supports interactive segmentation;
- it is highly composable and easy to integrate into larger systems;
- it enables strong zero-shot transfer.
2.2 Foundation-model perspective
SAM borrows the training philosophy behind models such as BERT and GPT:
- pretrain once on large and diverse data;
- generalize to many tasks without fine-tuning;
- keep the system modular and extensible.
In that sense, SAM is positioned as a general segmentation engine, not as a single-task model.
2.3 Decoupled design
SAM deliberately separates image processing from prompt processing:
- the image encoder extracts reusable image features;
- the prompt encoder handles the input prompt;
- the mask decoder fuses both sources of information and produces the output mask.
This decoupling matters because image features can be reused across many prompts, making interactive use much more efficient.
3. Model architecture
SAM consists of three major modules:
| Module | Role |
|---|---|
| Image Encoder | Produces dense image features |
| Prompt Encoder | Encodes points, boxes, masks, and other spatial prompts |
| Mask Decoder | Combines image and prompt features to generate masks |
3.1 Image encoder
SAM uses ViT-Huge (ViT-H) as the image backbone.
- input resolution:
1024 x 1024 - output: a
64 x 64grid of patch features
The image encoder only needs to run once per image, which is crucial for interactive scenarios.
3.2 Prompt encoder
The prompt encoder converts user prompts into embeddings. It supports:
- positive and negative points;
- bounding boxes;
- low-resolution mask inputs.
The main ingredients are positional encoding and learnable prompt embeddings.
3.3 Mask decoder
The mask decoder is a lightweight Transformer-style decoder.
Inputs:
- dense image features;
- sparse prompt tokens.
Outputs:
- three candidate masks;
- a quality score for each mask.
Producing multiple candidate masks is useful because user prompts can be ambiguous.
4. Inference flow
Given an image and a prompt:
- the image encoder extracts image features;
- the prompt encoder turns the prompt into prompt embeddings;
- the mask decoder combines both and outputs candidate masks;
- the user or downstream system picks the most suitable result.
5. Final remarks
My main takeaway is that SAM succeeds not only because of scale, but also because of problem formulation:
Instead of asking the model to solve one fixed segmentation task, SAM asks it to respond to prompts.
That shift makes segmentation more interactive, more modular, and more reusable.
