Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/sam3/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The SAM 3 video API uses a request-response pattern for all operations. This page documents the request and response formats for each operation type.
Request Structure
All requests are Python dictionaries with a type field:
request = {
"type": "request_type",
# ... type-specific parameters
}
Session Management
start_session
Start a new inference session on a video or image.
Request:
{
"type": "start_session",
"resource_path": str, # Path to video/image file or JPEG frame directory
"session_id": Optional[str] # Optional session ID (auto-generated if omitted)
}
Response:
{
"session_id": str # Session identifier for subsequent requests
}
Example:
response = predictor.handle_request({
"type": "start_session",
"resource_path": "/path/to/video.mp4"
})
session_id = response["session_id"]
reset_session
Reset session to its initial state (removes all prompts and results).
Request:
{
"type": "reset_session",
"session_id": str
}
Response:
close_session
Close and clean up a session (frees GPU memory).
Request:
{
"type": "close_session",
"session_id": str
}
Response:
Prompting
add_prompt
Add text, point, or box prompt on a specific video frame.
Request:
{
"type": "add_prompt",
"session_id": str,
"frame_index": int, # 0-based frame index
# Optional: text prompt
"text": Optional[str],
# Optional: point prompts
"points": Optional[List[List[float]]], # [[x1, y1], [x2, y2], ...]
"point_labels": Optional[List[int]], # [1, 0, ...] (1=foreground, 0=background)
# Optional: box prompts
"bounding_boxes": Optional[List[List[float]]], # [[cx, cy, w, h], ...] (normalized)
"bounding_box_labels": Optional[List[int]], # [1, 0, ...] (1=positive, 0=negative)
# Optional: object assignment
"obj_id": Optional[int] # Assign prompt to existing object
}
Response:
{
"frame_index": int,
"outputs": Dict[int, Dict] # Object ID -> segmentation result
}
Output Format:
outputs = {
obj_id: {
"mask": np.ndarray, # Binary mask (H, W), dtype=bool
"score": float, # Confidence score
"bbox": List[float], # [x0, y0, x1, y1] in pixels
}
}
Examples:
Text prompt:
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"text": "person"
})
Point prompt:
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"points": [[640, 360], [700, 400]],
"point_labels": [1, 1] # Both foreground
})
Box prompt:
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"bounding_boxes": [[0.5, 0.5, 0.3, 0.4]], # center_x, center_y, width, height (0-1)
"bounding_box_labels": [1] # Positive box
})
Combined prompts:
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"text": "dog",
"points": [[500, 300]],
"point_labels": [1],
"bounding_boxes": [[0.4, 0.3, 0.2, 0.3]],
"bounding_box_labels": [1]
})
remove_object
Remove an object from tracking.
Request:
{
"type": "remove_object",
"session_id": str,
"obj_id": int,
"is_user_action": bool # Whether this is a user-initiated removal
}
Response:
Example:
predictor.handle_request({
"type": "remove_object",
"session_id": session_id,
"obj_id": 1,
"is_user_action": True
})
Propagation
propagate_in_video
Propagate prompts to get segmentation results across video frames.
Request:
{
"type": "propagate_in_video",
"session_id": str,
"propagation_direction": str, # "forward", "backward", or "both"
"start_frame_index": Optional[int], # Starting frame (default: first prompted frame)
"max_frame_num_to_track": Optional[int] # Max frames to track (default: all)
}
Response (streaming):
This is a streaming request that yields responses:
for response in predictor.handle_stream_request(request):
# response format:
{
"frame_index": int,
"outputs": Dict[int, Dict] # Same format as add_prompt outputs
}
Examples:
Forward propagation:
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "forward",
"start_frame_index": 0
}):
frame_idx = response["frame_index"]
outputs = response["outputs"]
# Process frame
Backward propagation:
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "backward",
"start_frame_index": 100
}):
# Process frames 99, 98, 97, ...
pass
Bidirectional:
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "both",
"start_frame_index": 50,
"max_frame_num_to_track": 100
}):
# Processes frames 50->149, then 49->0
pass
Coordinate Systems
Points
Points are in pixel coordinates (x, y):
- x: horizontal position (0 to image_width)
- y: vertical position (0 to image_height)
"points": [[320, 240]] # x=320 pixels, y=240 pixels
Bounding Boxes
Boxes in requests use normalized center-width-height format:
- center_x: horizontal center (0.0 to 1.0)
- center_y: vertical center (0.0 to 1.0)
- width: box width (0.0 to 1.0)
- height: box height (0.0 to 1.0)
"bounding_boxes": [[0.5, 0.5, 0.3, 0.4]] # Center at 50%, 50%, size 30%x40%
Boxes in responses use pixel XYXY format:
- [x0, y0, x1, y1]: top-left and bottom-right corners in pixels
outputs[obj_id]["bbox"] = [100, 150, 300, 400] # x0, y0, x1, y1
Label Conventions
Point Labels
1: Foreground point (include this region)
0: Background point (exclude this region)
Box Labels
1: Positive box (include objects in this box)
0: Negative box (exclude objects in this box)
Error Handling
Invalid requests raise RuntimeError:
try:
response = predictor.handle_request(request)
except RuntimeError as e:
print(f"Request failed: {e}")
Common errors:
- Session not found: Invalid or expired
session_id
- Invalid frame index:
frame_index out of range
- Missing prompts: Propagation before adding any prompts
Complete Workflow Example
from sam3.model.sam3_video_predictor import Sam3VideoPredictor
predictor = Sam3VideoPredictor()
# 1. Start session
response = predictor.handle_request({
"type": "start_session",
"resource_path": "video.mp4"
})
session_id = response["session_id"]
# 2. Add prompt on first frame
response = predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"text": "person",
"points": [[640, 360]],
"point_labels": [1]
})
# 3. Propagate through video
results = []
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "forward"
}):
results.append(response)
# 4. Process results
for result in results:
frame_idx = result["frame_index"]
for obj_id, obj_data in result["outputs"].items():
mask = obj_data["mask"]
score = obj_data["score"]
bbox = obj_data["bbox"]
# Save or visualize
# 5. Clean up
predictor.handle_request({
"type": "close_session",
"session_id": session_id
})