This guide will help you run your first segmentation with SAM 3 in just a few minutes. We’ll cover both image and video segmentation with text prompts.
Before starting, make sure you have installed SAM 3 and authenticated with Hugging Face to access the model checkpoints.
Enable TensorFloat-32 and automatic mixed precision for faster inference:
# Enable TF32 for Ampere GPUs (A100, RTX 30/40 series)torch.backends.cuda.matmul.allow_tf32 = Truetorch.backends.cudnn.allow_tf32 = True# Use bfloat16 for automatic mixed precisiontorch.autocast("cuda", dtype=torch.bfloat16).__enter__()
TF32 provides a good balance between performance and accuracy on Ampere and newer GPUs.
3
Load the Model
Load the SAM 3 image model and create a processor:
# Build the model (downloads checkpoint on first run)model = build_sam3_image_model()# Create processor with default settingsprocessor = Sam3Processor(model)
The model will be automatically downloaded from Hugging Face on first use. This may take a few minutes depending on your internet connection.
4
Load Your Image
Load an image using PIL:
# Load your imageimage = Image.open("path/to/your/image.jpg")# Set the image in the processorinference_state = processor.set_image(image)
5
Prompt with Text
Use a text prompt to segment objects:
# Segment using a text descriptionoutput = processor.set_text_prompt( state=inference_state, prompt="a person wearing a red shirt")# Extract resultsmasks = output["masks"] # Segmentation masksboxes = output["boxes"] # Bounding boxesscores = output["scores"] # Confidence scores
6
Visualize Results
Display the segmentation results:
import matplotlib.pyplot as pltimport numpy as np# Show the original imageplt.figure(figsize=(10, 10))plt.imshow(image)# Overlay masksfor mask, score in zip(masks, scores): if score > 0.5: # Filter by confidence # Convert mask to numpy and show as overlay mask_np = mask.cpu().numpy() plt.imshow(mask_np, alpha=0.5, cmap='jet')plt.axis('off')plt.show()
SAM 3 also supports video segmentation with temporal tracking.
1
Import Video Predictor
Import the video predictor:
from sam3 import build_sam3_video_predictor
2
Build Video Predictor
Create the video predictor:
video_predictor = build_sam3_video_predictor()
3
Start a Session
Start a video segmentation session:
# Path to video (MP4 file or directory of JPEG frames)video_path = "path/to/your/video.mp4"# Start sessionresponse = video_predictor.handle_request( request=dict( type="start_session", resource_path=video_path, ))session_id = response["session_id"]
4
Add Text Prompt
Add a text prompt to segment objects across frames:
# Add prompt on a specific frameresponse = video_predictor.handle_request( request=dict( type="add_prompt", session_id=session_id, frame_index=0, # Frame to prompt on text="a dog running", ))# Get segmentation for all framesoutput = response["outputs"]
# First set text promptoutput = processor.set_text_prompt( state=inference_state, prompt="a car")# Then refine with a box promptbox = [0.5, 0.5, 0.3, 0.2]output = processor.add_geometric_prompt( box=box, label=True, state=inference_state)