Segment Anything Model (SAM)

About 875 wordsAbout 3 min

2025-08-07

This is my learning note for Segment Anything Model (SAM).

Main reference: Kirillov et al., "Segment Anything" (arXiv:2304.02643)

1. Introduction

The Segment Anything Model (SAM) represents a paradigm shift in image segmentation, positioning itself as a foundation model analogous to large language models in NLP. Its primary goal is to enable general-purpose, prompt-driven segmentation with zero-shot generalization, meaning it can segment novel objects without additional training.

SAM introduces the concept of promptable segmentation, where segmentation is treated as a task conditioned on flexible prompts — such as points, bounding boxes, or masks. It is trained on a uniquely large dataset (SA-1B), allowing it to generalize across domains and tasks.

2. Core Ideas and Design Principles

2.1 Promptable Segmentation

SAM redefines segmentation as a prompt-conditioned task. The model does not output segmentation maps unconditionally. Instead, it receives a prompt specifying what to segment, and it returns a corresponding mask.

Supported prompt types:

Points: foreground/background clicks
Bounding boxes
Masks: optional prior masks for refinement

This formulation enables:

Interactive segmentation in real time
Composability for downstream systems
Zero-shot transfer across segmentation use cases

2.2 Foundation Model Perspective

SAM adopts the foundation model philosophy:

Train once on a massive, diverse dataset
Generalize to a wide range of tasks without retraining
Modular architecture for extensibility

Like GPT for text, SAM is designed to be used as a general segmentation engine across tasks and domains.

2.3 Decoupled Design

SAM separates image and prompt processing:

The image encoder embeds the image once.
The prompt encoder transforms user inputs into learned embeddings.
The mask decoder combines both to predict masks.

This architectural separation allows:

Fast inference (image embedding reused)
Efficient processing of multiple prompts per image
Modular use in systems with varying interaction requirements

3. Architecture Overview

SAM comprises three primary components:

Component	Role
Image Encoder	Produces a dense embedding of the input image
Prompt Encoder	Encodes spatial prompts (points, boxes, masks)
Mask Decoder	Combines image and prompt embeddings to produce masks

3.1 Image Encoder

Backbone: ViT-Huge (ViT-H) transformer
Input resolution: 1024×1024
Output: 64×64 grid of patch tokens

The image encoder operates once per image, making subsequent segmentation conditioned only on the prompt. This allows fast and scalable inference.

3.2 Prompt Encoder

Input: prompts as spatial coordinates (points, boxes, masks)
Encoded via learned embeddings and positional encodings
Supports:
- Foreground/background points (positive/negative)
- Bounding boxes
- Optional coarse masks (encoded as low-res spatial maps)

The encoded prompts are transformed into tokens that are fed to the decoder along with image features.

3.3 Mask Decoder

Architecture: Lightweight Transformer decoder
Inputs:
- Sparse prompt tokens
- Dense image embeddings (from ViT)
Outputs:
- 3 candidate masks (per prompt)
- Associated mask quality/confidence scores

Each mask corresponds to a possible interpretation of the prompt. This multi-mask output accounts for segmentation ambiguity and supports interactive refinement.

3.4 Inference Pipeline

Given an input image and prompt:

Image → Image Encoder → Image Embedding
                        ⇩
             Prompt → Prompt Encoder → Prompt Embedding
                        ⇩
               Mask Decoder → Candidate Segmentation Masks

4. Training Methodology

SAM is trained on Segment Anything 1-Billion (SA-1B), the largest known segmentation dataset.

4.1 Dataset: SA-1B

Over 1 billion masks
Spanning 11 million high-resolution images
Labeled via a semi-automated interactive system:
- A model-in-the-loop suggests masks
- Human annotators correct/guide results
- System iteratively improves during data collection

4.2 Training Objective

Supervised learning using prompt-mask pairs
Mask loss: binary cross-entropy + IoU-based losses
Diverse prompt types included during training
Random augmentations and prompt sampling used to simulate various use conditions

5. Generalization and Capabilities

SAM demonstrates strong zero-shot generalization to:

Natural images, indoor/outdoor scenes
Art, satellite imagery, historical photos
Synthetic and scientific images (e.g., microscopy)

However, it does not produce class labels — SAM only returns masks, not semantic categories.

Advantages:

Works with unseen objects
Supports interactive segmentation
Can be composed with other models (e.g., CLIP, GroundingDINO)

Limitations:

Coarse masks on fine structures or small objects
Large model (ViT-H is resource-intensive)
Requires post-processing for instance tracking or semantic labeling

6. Theoretical Impact and Innovation

Innovation	Implication
Promptable segmentation	Unifies multiple segmentation tasks under one model
Decoupled image/prompt encoders	Enables fast reuse and interactivity
Mask ambiguity output	Captures uncertainty, supports user correction
Large-scale interactive data	Makes generalization feasible
Foundation model for vision	First step toward vision models like GPT in NLP

7. Extensions and Ecosystem

SAM has sparked a wide ecosystem of extensions:

Grounded-SAM: combines SAM with text-grounding (e.g., from CLIP)
MedSAM: fine-tuned SAM for medical imaging
MobileSAM: efficient variant for edge devices
SAM + DETR/Mask R-CNN: instance-aware adaptations

SAM is increasingly being used as a building block in broader CV pipelines.

8. Summary

Segment Anything Model (SAM) is a landmark foundation model in computer vision, enabling prompt-driven, general-purpose segmentation at scale. Its architecture and training methodology exemplify how prompt-based designs and massive pretraining can lead to strong zero-shot generalization.

SAM is not just a model — it is a framework for interactive, flexible, and extensible segmentation tasks.

Fundamentals of Images

Classical Image Processing and Feature Extraction