Skip to main content

Quick Start Guide

This guide will help you run your first segmentation with SAM 3 in just a few minutes. We’ll cover both image and video segmentation with text prompts.
Before starting, make sure you have installed SAM 3 and authenticated with Hugging Face to access the model checkpoints.

Image Segmentation

Let’s start with a simple image segmentation example using a text prompt.
1

Import Dependencies

Import the required modules:
import torch
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
2

Enable GPU Optimizations

Enable TensorFloat-32 and automatic mixed precision for faster inference:
# Enable TF32 for Ampere GPUs (A100, RTX 30/40 series)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for automatic mixed precision
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
TF32 provides a good balance between performance and accuracy on Ampere and newer GPUs.
3

Load the Model

Load the SAM 3 image model and create a processor:
# Build the model (downloads checkpoint on first run)
model = build_sam3_image_model()

# Create processor with default settings
processor = Sam3Processor(model)
The model will be automatically downloaded from Hugging Face on first use. This may take a few minutes depending on your internet connection.
4

Load Your Image

Load an image using PIL:
# Load your image
image = Image.open("path/to/your/image.jpg")

# Set the image in the processor
inference_state = processor.set_image(image)
5

Prompt with Text

Use a text prompt to segment objects:
# Segment using a text description
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a person wearing a red shirt"
)

# Extract results
masks = output["masks"]     # Segmentation masks
boxes = output["boxes"]     # Bounding boxes
scores = output["scores"]   # Confidence scores
6

Visualize Results

Display the segmentation results:
import matplotlib.pyplot as plt
import numpy as np

# Show the original image
plt.figure(figsize=(10, 10))
plt.imshow(image)

# Overlay masks
for mask, score in zip(masks, scores):
    if score > 0.5:  # Filter by confidence
        # Convert mask to numpy and show as overlay
        mask_np = mask.cpu().numpy()
        plt.imshow(mask_np, alpha=0.5, cmap='jet')

plt.axis('off')
plt.show()

Complete Image Example

Here’s the complete code for image segmentation:
import torch
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Enable optimizations
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

# Load model and processor
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load and process image
image = Image.open("your_image.jpg")
inference_state = processor.set_image(image)

# Run segmentation with text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a cat"
)

# Get results
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
print(f"Found {len(masks)} objects with average score {scores.mean():.2f}")

Video Segmentation

SAM 3 also supports video segmentation with temporal tracking.
1

Import Video Predictor

Import the video predictor:
from sam3 import build_sam3_video_predictor
2

Build Video Predictor

Create the video predictor:
video_predictor = build_sam3_video_predictor()
3

Start a Session

Start a video segmentation session:
# Path to video (MP4 file or directory of JPEG frames)
video_path = "path/to/your/video.mp4"

# Start session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)

session_id = response["session_id"]
4

Add Text Prompt

Add a text prompt to segment objects across frames:
# Add prompt on a specific frame
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=session_id,
        frame_index=0,  # Frame to prompt on
        text="a dog running",
    )
)

# Get segmentation for all frames
output = response["outputs"]

Complete Video Example

from sam3 import build_sam3_video_predictor

# Build predictor
video_predictor = build_sam3_video_predictor()

# Start session
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path="video.mp4",
    )
)

# Add text prompt
response = video_predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=response["session_id"],
        frame_index=0,
        text="person in blue jacket",
    )
)

# Get tracked segments across all frames
output = response["outputs"]
print(f"Segmented {len(output)} frames")

Adding Geometric Prompts

You can also use geometric prompts (boxes, points) in addition to or instead of text.

Box Prompts

from sam3.model.box_ops import box_xywh_to_cxcywh

# Define a box in [x, y, width, height] format (normalized 0-1)
box = [0.3, 0.3, 0.2, 0.4]  # center_x, center_y, width, height

# Add positive box prompt
output = processor.add_geometric_prompt(
    box=box,
    label=True,  # True for positive, False for negative
    state=inference_state
)

Combining Text and Geometric Prompts

# First set text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a car"
)

# Then refine with a box prompt
box = [0.5, 0.5, 0.3, 0.2]
output = processor.add_geometric_prompt(
    box=box,
    label=True,
    state=inference_state
)

Batch Processing

Process multiple images efficiently in batch:
import PIL.Image

# Load multiple images
images = [
    PIL.Image.open("image1.jpg"),
    PIL.Image.open("image2.jpg"),
    PIL.Image.open("image3.jpg"),
]

# Set image batch
inference_state = processor.set_image_batch(images)

# Run batch inference with text prompts
prompts = ["a dog", "a cat", "a bird"]
outputs = processor.set_text_prompt_batch(
    prompts=prompts,
    state=inference_state
)

Configuration Options

Customize the processor for your use case:
# Custom processor settings
processor = Sam3Processor(
    model=model,
    resolution=1008,              # Input resolution (default: 1008)
    device="cuda",                # Device (cuda/cpu)
    confidence_threshold=0.5      # Minimum confidence score
)

Tips for Best Results

  • Be specific: “a person in a red jacket” works better than “person”
  • Use descriptive attributes: colors, positions, actions
  • For multiple objects: “all dogs in the image” or “dogs”
  • Negative examples: prompts with no matches return empty results
  • Use batch processing for multiple images
  • Enable TF32 and mixed precision (bfloat16)
  • Lower resolution for faster inference (trade-off with quality)
  • Use torch.inference_mode() or torch.no_grad() contexts
  • Adjust confidence_threshold in processor settings
  • Try more specific text prompts
  • Combine text with geometric prompts for better accuracy
  • Some objects may genuinely not be present in the image
  • Start prompts on frames where objects are clearly visible
  • Use multiple prompts across different frames for better tracking
  • Video format can be MP4 or a directory of JPEG frames
  • Session management allows processing multiple videos

Next Steps

Now that you’ve run your first segmentation, explore more advanced features:

API Reference

Detailed API documentation for all SAM 3 components

Batched Inference

Process multiple images efficiently

Video Tracking

Deep dive into video segmentation and tracking

Interactive Refinement

Learn to refine segmentations with points and boxes