Skip to content

[P1] Fix prediction-GT matching in both model evaluators (TP double-count / dropped duplicate FPs); re-quantify gold-set P/R #9

Description

@jonfroehlich

Both model evaluators implement prediction→GT matching without one-to-one discipline, and the errors bias the published numbers upward:

  • stage_two/evaluate.py:419-433: the inner GT loop has no break — a single prediction within the matching radius of two ground-truth points appends two TP entries (inflates recall and the TP mass under the PR curve). Worse, a prediction whose only in-radius GTs are already claimed sets passes=True but appends nothing — standard protocols (VOC/COCO) count duplicate detections as false positives; here they vanish, inflating precision.
  • stage_one/crop_model/ps_and_manual_model/evaluate.py:365-385: same duplicate-drop pattern (duplicate matches are neither TP nor FP).
  • By contrast, stage_one/dataset_evaluation/evaluate.py:144-163 does this correctly (breaks after claiming the first unclaimed GT, buckets duplicates separately) — so the repo contains both the right and wrong pattern, and the Stage-2 evaluator (the one producing the paper's model PR curves) is the buggy variant.

Related metric issues to fix in the same pass:

  • AP is computed with raw, uninterpolated precision (stage_two/evaluate.py:217-250) — not comparable to VOC/COCO AP.
  • The 0.022 "radius" is applied as 0.022 * heatmap_w against x scaled by w=1024 and y scaled by h=512 (evaluate.py:326,420-421) — it's actually an ellipse 2× more permissive vertically. Consistent with Stage 1, but should be documented.

Fix: greedy one-to-one matching (predictions sorted by confidence, each claims at most one unclaimed GT, duplicates counted as FP), optional interpolated AP. Then re-run the 1k-manual-gold-set evaluation with corrected matching and quantify the delta vs the published precision/recall; if material, add an erratum note to the README (and consider an arXiv v2 note).

From the post-release code review (July 2026).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions