COCO Evaluation

Overview

The CocoEvaluator class provides standard COCO evaluation metrics (AP, AR) for segmentation and detection tasks with distributed training support.

CocoEvaluator

Class Initialization

from sam3.eval.coco_eval import CocoEvaluator

evaluator = CocoEvaluator(
    coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=None,
    average_by_rarity=False,
    use_normalized_areas=True,
    maxdets=[1, 10, 100],
    exhaustive_only=False,
    all_exhaustive_only=True
)

Parameters

coco_gt

COCO | list[COCO]

required

COCO API object(s) containing ground truth annotations. Can be a single COCO object or list for oracle evaluation.

iou_types

list[str]

required

Types of IoU to evaluate: ["segm"] for masks, ["bbox"] for boxes, or both.

useCats

bool

required

Whether to use categories for evaluation. Set False for open-vocabulary tasks.

dump_dir

str | None

required

Directory to dump predictions. If None, predictions are not saved.

postprocessor

object

required

Postprocessor module to convert model outputs to COCO format.

average_by_rarity

bool

default:"False"

Whether to compute AP separately for different object rarity buckets and average.

use_normalized_areas

bool

default:"True"

Whether object areas are normalized by image area. Affects size bucket definitions.

maxdets

list[int]

default:"[1, 10, 100]"

Maximum number of detections to evaluate per image.

exhaustive_only

bool

default:"False"

Whether to restrict evaluation to exhaustively annotated images only.

all_exhaustive_only

bool

default:"True"

Whether to require all ground truth sources to be exhaustive (for oracle evaluation).

Methods

update

Update evaluator with model outputs.

evaluator.update(
    model_outputs,
    targets,
    image_ids
)

synchronize_between_processes

Synchronize predictions across distributed processes.

evaluator.synchronize_between_processes()

accumulate

Accumulate evaluation results.

evaluator.accumulate(imgIds=None)

summarize

Compute and print summary metrics.

results = evaluator.summarize()

results

dict

Dictionary containing COCO metrics:

coco_eval_masks_AP: Mask AP (averaged over IoU thresholds)
coco_eval_masks_AP_50: Mask AP @ IoU=0.5
coco_eval_masks_AP_75: Mask AP @ IoU=0.75
coco_eval_masks_AP_{size}: AP by size (tiny/small/medium/large/huge)
coco_eval_masks_AR: Average Recall
Similar metrics for bbox if enabled

compute_synced

Run full evaluation pipeline (sync + accumulate + summarize).

results = evaluator.compute_synced()

Example Usage

Basic Evaluation

from pycocotools.coco import COCO
from sam3.eval.coco_eval import CocoEvaluator

# Load ground truth
coco_gt = COCO("annotations.json")

# Initialize evaluator
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,  # Open-vocabulary
    dump_dir="./predictions",
    postprocessor=my_postprocessor
)

# During evaluation loop
for batch in dataloader:
    outputs = model(batch)
    evaluator.update(outputs, batch["targets"], batch["image_ids"])

# Compute final metrics
results = evaluator.compute_synced()

print(f"Mask AP: {results['coco_eval_masks_AP']:.3f}")
print(f"Mask AP50: {results['coco_eval_masks_AP_50']:.3f}")
print(f"Mask AP75: {results['coco_eval_masks_AP_75']:.3f}")

Distributed Training

import torch.distributed as dist

# Initialize evaluator on all ranks
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=True,
    dump_dir="./predictions",
    postprocessor=postprocessor
)

# Each rank processes its data
for batch in dataloader:
    outputs = model(batch)
    evaluator.update(outputs, batch["targets"], batch["image_ids"])

# Synchronize across ranks
evaluator.synchronize_between_processes()

# Only rank 0 computes and prints metrics
if dist.get_rank() == 0:
    results = evaluator.summarize()

Box and Mask Evaluation

# Evaluate both boxes and masks
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["bbox", "segm"],
    useCats=True,
    dump_dir=None,
    postprocessor=postprocessor
)

# ... run evaluation ...

results = evaluator.compute_synced()

print(f"Box AP: {results['coco_eval_bbox_AP']:.3f}")
print(f"Mask AP: {results['coco_eval_masks_AP']:.3f}")

Custom Max Detections

# Evaluate with different max detection thresholds
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=postprocessor,
    maxdets=[1, 10, 300]  # Custom thresholds
)

Normalized Areas

# When object areas are normalized by image area
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=postprocessor,
    use_normalized_areas=True  # Adjusts size buckets
)

# Size buckets become:
# - tiny: [0, 0.001]
# - small: [0.001, 0.01]
# - medium: [0.01, 0.1]
# - large: [0.1, 0.5]
# - huge: [0.5, 0.95]
# - whole_image: [0.95, inf]

Metrics Explained

Average Precision (AP)

AP - Mean AP over IoU thresholds [0.5, 0.95] with step 0.05 AP_50 - AP at IoU threshold 0.5 (loose localization) AP_75 - AP at IoU threshold 0.75 (strict localization) AP_ - AP for specific object sizes:

tiny: Very small objects (area < 0.1% of image)
small: Small objects (0.1% - 1% of image)
medium: Medium objects (1% - 10% of image)
large: Large objects (10% - 50% of image)
huge: Very large objects (50% - 95% of image)
whole_image: Nearly entire image (> 95%)

Average Recall (AR)

AR - Mean recall at max detections threshold AR_50 - AR at maxDets=50 (if maxdets includes 50) AR_75 - AR at maxDets=75 (if maxdets includes 75) AR_ - Recall by object size

Postprocessor Requirements

The postprocessor must implement:

class MyPostprocessor:
    def process_results(self, outputs, targets, image_ids):
        """
        Convert model outputs to COCO prediction format.
        
        Returns:
            dict: {image_id: {"masks": ..., "boxes": ..., "scores": ..., "labels": ...}}
        """
        predictions = {}
        for img_id, output in zip(image_ids, outputs):
            predictions[img_id] = {
                "masks": output["masks"],  # (N, H, W) binary masks
                "boxes": output["boxes"],  # (N, 4) boxes in XYXY format
                "scores": output["scores"],  # (N,) confidence scores
                "labels": output["labels"],  # (N,) category IDs
            }
        return predictions

COCO Format Requirements

Ground Truth

{
  "images": [
    {"id": 1, "width": 640, "height": 480, "file_name": "image.jpg"}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "segmentation": {"size": [480, 640], "counts": "..."},  // RLE
      "area": 5000,
      "bbox": [x, y, w, h],
      "iscrowd": 0
    }
  ],
  "categories": [
    {"id": 1, "name": "person", "supercategory": "person"}
  ]
}

Predictions

Predictions are automatically converted to:

[
  {
    "image_id": 1,
    "category_id": 1,
    "segmentation": {"size": [480, 640], "counts": "..."},
    "score": 0.95,
    "area": 5000
  }
]

Notes

Uses pycocotools internally
Supports distributed evaluation across multiple GPUs
Predictions can be dumped to disk for later analysis
Size buckets automatically adjusted for normalized areas
Compatible with COCO, LVIS, and custom datasets in COCO format
For open-vocabulary tasks, set useCats=False

Documentation Index

​Overview

​CocoEvaluator

​Class Initialization

​Parameters

​Methods

​update

​synchronize_between_processes

​accumulate

​summarize

​compute_synced

​Example Usage

​Basic Evaluation

​Distributed Training

​Box and Mask Evaluation

​Custom Max Detections

​Normalized Areas

​Metrics Explained

​Average Precision (AP)

​Average Recall (AR)

​Postprocessor Requirements

​COCO Format Requirements

​Ground Truth

​Predictions

​Notes

​See Also

Overview

CocoEvaluator

Class Initialization

Parameters

Methods

update

synchronize_between_processes

accumulate

summarize

compute_synced

Example Usage

Basic Evaluation

Distributed Training

Box and Mask Evaluation

Custom Max Detections

Normalized Areas

Metrics Explained

Average Precision (AP)

Average Recall (AR)

Postprocessor Requirements

COCO Format Requirements

Ground Truth

Predictions

Notes

See Also