SAM 3 Agent - SAM 3

SAM 3 Agent enables complex segmentation queries by integrating Multi-modal Large Language Models (MLLMs) as a reasoning layer. The MLLM breaks down complex prompts into simpler queries that SAM 3 can process.

What is SAM 3 Agent?

SAM 3 Agent allows you to use natural, complex language to describe objects:

❌ Simple: “person”, “blue vest”
✅ Complex: “the leftmost child wearing blue vest”
✅ Relational: “the person standing behind the dog”
✅ Descriptive: “the tallest building in the background”

The agent workflow:

MLLM analyzes the image and your complex query
MLLM generates simpler prompts for SAM 3 (text/box)
SAM 3 performs the actual segmentation
Results are returned with visual overlays

Setup

Install SAM 3

Follow the installation instructions in the repository.

Configure PyTorch

import torch

# Turn on tfloat32 for Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for the entire notebook
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

# Inference mode for the whole notebook
torch.inference_mode().__enter__()

Build SAM 3 model

import os
import sam3
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

sam3_root = os.path.dirname(sam3.__file__)
bpe_path = f"{sam3_root}/assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model(bpe_path=bpe_path)
processor = Sam3Processor(model, confidence_threshold=0.5)

MLLM Configuration

SAM 3 Agent supports various MLLMs. You can use either:

vLLM-served models (self-hosted)
External APIs (Gemini, GPT, Claude, etc.)

Option 1: vLLM (Self-Hosted)

Configuration
Installation
Start Server

LLM_CONFIGS = {
    "qwen3_vl_8b_thinking": {
        "provider": "vllm",
        "model": "Qwen/Qwen3-VL-8B-Thinking",
    },
}

model = "qwen3_vl_8b_thinking"
LLM_API_KEY = "DUMMY_API_KEY"  # Not used for vLLM
LLM_SERVER_URL = "http://0.0.0.0:8001/v1"

llm_config = LLM_CONFIGS[model]
llm_config["api_key"] = LLM_API_KEY
llm_config["name"] = model

Install vLLM in a separate conda environment:

conda create -n vllm python=3.12
conda activate vllm
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128

Launch the vLLM server:

vllm serve Qwen/Qwen3-VL-8B-Thinking \
  --tensor-parallel-size 4 \
  --allowed-local-media-path / \
  --enforce-eager \
  --port 8001

Option 2: External API

LLM_CONFIGS = {
    "gemini_flash": {
        "provider": "google",
        "model": "gemini-2.0-flash-exp",
        "base_url": "https://generativelanguage.googleapis.com/v1beta/",
    },
}

model = "gemini_flash"
LLM_API_KEY = "your-api-key-here"  # Set your actual API key
LLM_SERVER_URL = LLM_CONFIGS[model]["base_url"]

llm_config = LLM_CONFIGS[model]
llm_config["api_key"] = LLM_API_KEY
llm_config["name"] = model

Never commit API keys to version control. Use environment variables:

import os
LLM_API_KEY = os.getenv("GEMINI_API_KEY")

Running Agent Inference

from functools import partial
from sam3.agent.client_llm import send_generate_request as send_generate_request_orig
from sam3.agent.client_sam3 import call_sam_service as call_sam_service_orig
from sam3.agent.inference import run_single_image_inference

# Prepare input
image = "assets/images/test_image.jpg"
prompt = "the leftmost child wearing blue vest"
image = os.path.abspath(image)

# Create service clients
send_generate_request = partial(
    send_generate_request_orig,
    server_url=LLM_SERVER_URL,
    model=llm_config["model"],
    api_key=llm_config["api_key"]
)
call_sam_service = partial(call_sam_service_orig, sam3_processor=processor)

# Run inference
output_image_path = run_single_image_inference(
    image,
    prompt,
    llm_config,
    send_generate_request,
    call_sam_service,
    debug=True,
    output_dir="agent_output"
)

# Display result
if output_image_path is not None:
    from IPython.display import display, Image
    display(Image(filename=output_image_path))

How It Works

Query Understanding

The MLLM analyzes your complex prompt:

Identifies spatial relationships (“leftmost”, “behind”)
Extracts visual attributes (“blue vest”, “wearing”)
Understands context and object relationships

Prompt Decomposition

The MLLM generates structured prompts for SAM 3:

{
  "text_prompts": ["child", "blue vest"],
  "spatial_filter": "leftmost",
  "relationship": "wearing"
}

SAM 3 Segmentation

SAM 3 processes the simplified prompts:

Segments all children in the image
Segments all blue vests
Returns candidates with confidence scores

Result Filtering

The MLLM filters and ranks results:

Applies spatial constraints (“leftmost”)
Verifies relationships (“wearing”)
Returns the best match

Debugging Output

Enable debug mode to see the agent’s reasoning:

output_image_path = run_single_image_inference(
    image, prompt, llm_config,
    send_generate_request, call_sam_service,
    debug=True,  # Enable debug output
    output_dir="agent_output"
)

Debug output shows:

MLLM’s interpretation of your query
Generated SAM 3 prompts
Intermediate segmentation results
Final filtering decisions

Example Queries

Spatial Relations
Visual Attributes
Actions and States
Complex Combinations

# Directional
"the rightmost person"
"the object in the top-left corner"
"the car furthest from the camera"

# Positional
"the person standing behind the table"
"the object between the two chairs"
"the animal closest to the door"

# Color and pattern
"the person wearing a red striped shirt"
"the blue car with white stripes"

# Size and shape
"the tallest building"
"the smallest cat"
"the round object on the table"

# Actions
"the person holding a phone"
"the dog running towards the camera"

# States
"the open door"
"the lit lamp"
"the broken window"

"the tallest person wearing a blue shirt standing on the left"
"the second car from the right with red tail lights"
"the child in the middle holding a yellow ball"

Supported MLLMs

Tested models (add your own to LLM_CONFIGS):

Provider	Model	Best For
vLLM	Qwen/Qwen3-VL-8B-Thinking	Self-hosted, good reasoning
Google	gemini-2.0-flash-exp	Fast, API-based
OpenAI	gpt-4-vision-preview	High accuracy
Anthropic	claude-3-opus-20240229	Complex reasoning

Tips for Best Results

Be specific but natural:

✅ “the leftmost child wearing blue vest”
❌ “child blue vest left” (too terse)
❌ “I want to segment the child who is positioned on the left side and is currently wearing clothing that appears to be blue and vest-like” (too verbose)

Use relative positions:

✅ “the person on the right”
✅ “the second from the left”
❌ “the person at pixel coordinates (450, 230)” (use box prompts instead)

Combine multiple cues:

✅ “the red car behind the truck”
❌ “the thing” (too vague)

Troubleshooting

MLLM returns no results

Check if your query is too ambiguous
Try breaking complex queries into simpler parts
Verify the MLLM can see the image (check debug output)

Wrong object segmented

Add more specific attributes to your query
Use spatial relationships to disambiguate
Check SAM 3’s confidence threshold (lower if needed)

vLLM server errors

Ensure server is running: curl http://localhost:8001/health
Check GPU memory availability
Verify --allowed-local-media-path includes your image directory

API rate limits

Implement exponential backoff for retries
Use local vLLM for high-volume processing
Cache MLLM responses for repeated queries

Next Steps

Image Inference

Learn direct SAM 3 prompting without MLLMs

Interactive Refinement

Combine agent results with interactive refinement

Documentation Index

​What is SAM 3 Agent?

​Setup

​MLLM Configuration

​Option 1: vLLM (Self-Hosted)

​Option 2: External API

​Running Agent Inference

​How It Works

​Debugging Output

​Example Queries

​Supported MLLMs

​Tips for Best Results

​Troubleshooting

​Next Steps

Image Inference

Interactive Refinement

What is SAM 3 Agent?

Setup

MLLM Configuration

Option 1: vLLM (Self-Hosted)

Option 2: External API

Running Agent Inference

How It Works

Debugging Output

Example Queries

Supported MLLMs

Tips for Best Results

Troubleshooting

Next Steps