Documentation Index Fetch the complete documentation index at: https://mintlify.com/facebookresearch/sam3/llms.txt
Use this file to discover all available pages before exploring further.
SAM 3 Agent enables complex segmentation queries by integrating Multi-modal Large Language Models (MLLMs) as a reasoning layer. The MLLM breaks down complex prompts into simpler queries that SAM 3 can process.
What is SAM 3 Agent?
SAM 3 Agent allows you to use natural, complex language to describe objects:
❌ Simple: “person”, “blue vest”
✅ Complex: “the leftmost child wearing blue vest”
✅ Relational: “the person standing behind the dog”
✅ Descriptive: “the tallest building in the background”
The agent workflow:
MLLM analyzes the image and your complex query
MLLM generates simpler prompts for SAM 3 (text/box)
SAM 3 performs the actual segmentation
Results are returned with visual overlays
Setup
Configure PyTorch
import torch
# Turn on tfloat32 for Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Use bfloat16 for the entire notebook
torch.autocast( "cuda" , dtype = torch.bfloat16). __enter__ ()
# Inference mode for the whole notebook
torch.inference_mode(). __enter__ ()
Build SAM 3 model
import os
import sam3
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
sam3_root = os.path.dirname(sam3. __file__ )
bpe_path = f " { sam3_root } /assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model( bpe_path = bpe_path)
processor = Sam3Processor(model, confidence_threshold = 0.5 )
MLLM Configuration
SAM 3 Agent supports various MLLMs. You can use either:
vLLM-served models (self-hosted)
External APIs (Gemini, GPT, Claude, etc.)
Option 1: vLLM (Self-Hosted)
Configuration
Installation
Start Server
LLM_CONFIGS = {
"qwen3_vl_8b_thinking" : {
"provider" : "vllm" ,
"model" : "Qwen/Qwen3-VL-8B-Thinking" ,
},
}
model = "qwen3_vl_8b_thinking"
LLM_API_KEY = "DUMMY_API_KEY" # Not used for vLLM
LLM_SERVER_URL = "http://0.0.0.0:8001/v1"
llm_config = LLM_CONFIGS [model]
llm_config[ "api_key" ] = LLM_API_KEY
llm_config[ "name" ] = model
Install vLLM in a separate conda environment: conda create -n vllm python= 3.12
conda activate vllm
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
Launch the vLLM server: vllm serve Qwen/Qwen3-VL-8B-Thinking \
--tensor-parallel-size 4 \
--allowed-local-media-path / \
--enforce-eager \
--port 8001
Option 2: External API
LLM_CONFIGS = {
"gemini_flash" : {
"provider" : "google" ,
"model" : "gemini-2.0-flash-exp" ,
"base_url" : "https://generativelanguage.googleapis.com/v1beta/" ,
},
}
model = "gemini_flash"
LLM_API_KEY = "your-api-key-here" # Set your actual API key
LLM_SERVER_URL = LLM_CONFIGS [model][ "base_url" ]
llm_config = LLM_CONFIGS [model]
llm_config[ "api_key" ] = LLM_API_KEY
llm_config[ "name" ] = model
Never commit API keys to version control. Use environment variables: import os
LLM_API_KEY = os.getenv( "GEMINI_API_KEY" )
Running Agent Inference
from functools import partial
from sam3.agent.client_llm import send_generate_request as send_generate_request_orig
from sam3.agent.client_sam3 import call_sam_service as call_sam_service_orig
from sam3.agent.inference import run_single_image_inference
# Prepare input
image = "assets/images/test_image.jpg"
prompt = "the leftmost child wearing blue vest"
image = os.path.abspath(image)
# Create service clients
send_generate_request = partial(
send_generate_request_orig,
server_url = LLM_SERVER_URL ,
model = llm_config[ "model" ],
api_key = llm_config[ "api_key" ]
)
call_sam_service = partial(call_sam_service_orig, sam3_processor = processor)
# Run inference
output_image_path = run_single_image_inference(
image,
prompt,
llm_config,
send_generate_request,
call_sam_service,
debug = True ,
output_dir = "agent_output"
)
# Display result
if output_image_path is not None :
from IPython.display import display, Image
display(Image( filename = output_image_path))
How It Works
Query Understanding
The MLLM analyzes your complex prompt:
Identifies spatial relationships (“leftmost”, “behind”)
Extracts visual attributes (“blue vest”, “wearing”)
Understands context and object relationships
Prompt Decomposition
The MLLM generates structured prompts for SAM 3: {
"text_prompts" : [ "child" , "blue vest" ],
"spatial_filter" : "leftmost" ,
"relationship" : "wearing"
}
SAM 3 Segmentation
SAM 3 processes the simplified prompts:
Segments all children in the image
Segments all blue vests
Returns candidates with confidence scores
Result Filtering
The MLLM filters and ranks results:
Applies spatial constraints (“leftmost”)
Verifies relationships (“wearing”)
Returns the best match
Debugging Output
Enable debug mode to see the agent’s reasoning:
output_image_path = run_single_image_inference(
image, prompt, llm_config,
send_generate_request, call_sam_service,
debug = True , # Enable debug output
output_dir = "agent_output"
)
Debug output shows:
MLLM’s interpretation of your query
Generated SAM 3 prompts
Intermediate segmentation results
Final filtering decisions
Example Queries
Spatial Relations
Visual Attributes
Actions and States
Complex Combinations
# Directional
"the rightmost person"
"the object in the top-left corner"
"the car furthest from the camera"
# Positional
"the person standing behind the table"
"the object between the two chairs"
"the animal closest to the door"
# Color and pattern
"the person wearing a red striped shirt"
"the blue car with white stripes"
# Size and shape
"the tallest building"
"the smallest cat"
"the round object on the table"
# Actions
"the person holding a phone"
"the dog running towards the camera"
# States
"the open door"
"the lit lamp"
"the broken window"
"the tallest person wearing a blue shirt standing on the left"
"the second car from the right with red tail lights"
"the child in the middle holding a yellow ball"
Supported MLLMs
Tested models (add your own to LLM_CONFIGS):
Provider Model Best For vLLM Qwen/Qwen3-VL-8B-Thinking Self-hosted, good reasoning Google gemini-2.0-flash-exp Fast, API-based OpenAI gpt-4-vision-preview High accuracy Anthropic claude-3-opus-20240229 Complex reasoning
Tips for Best Results
Be specific but natural:
✅ “the leftmost child wearing blue vest”
❌ “child blue vest left” (too terse)
❌ “I want to segment the child who is positioned on the left side and is currently wearing clothing that appears to be blue and vest-like” (too verbose)
Use relative positions:
✅ “the person on the right”
✅ “the second from the left”
❌ “the person at pixel coordinates (450, 230)” (use box prompts instead)
Combine multiple cues:
✅ “the red car behind the truck”
❌ “the thing” (too vague)
Troubleshooting
Check if your query is too ambiguous
Try breaking complex queries into simpler parts
Verify the MLLM can see the image (check debug output)
Add more specific attributes to your query
Use spatial relationships to disambiguate
Check SAM 3’s confidence threshold (lower if needed)
Ensure server is running: curl http://localhost:8001/health
Check GPU memory availability
Verify --allowed-local-media-path includes your image directory
Implement exponential backoff for retries
Use local vLLM for high-volume processing
Cache MLLM responses for repeated queries
Next Steps
Image Inference Learn direct SAM 3 prompting without MLLMs
Interactive Refinement Combine agent results with interactive refinement