Both model evaluators implement prediction→GT matching without one-to-one discipline, and the errors bias the published numbers upward:
stage_two/evaluate.py:419-433: the inner GT loop has no break — a single prediction within the matching radius of two ground-truth points appends two TP entries (inflates recall and the TP mass under the PR curve). Worse, a prediction whose only in-radius GTs are already claimed sets passes=True but appends nothing — standard protocols (VOC/COCO) count duplicate detections as false positives; here they vanish, inflating precision.
stage_one/crop_model/ps_and_manual_model/evaluate.py:365-385: same duplicate-drop pattern (duplicate matches are neither TP nor FP).
- By contrast,
stage_one/dataset_evaluation/evaluate.py:144-163 does this correctly (breaks after claiming the first unclaimed GT, buckets duplicates separately) — so the repo contains both the right and wrong pattern, and the Stage-2 evaluator (the one producing the paper's model PR curves) is the buggy variant.
Related metric issues to fix in the same pass:
- AP is computed with raw, uninterpolated precision (
stage_two/evaluate.py:217-250) — not comparable to VOC/COCO AP.
- The 0.022 "radius" is applied as
0.022 * heatmap_w against x scaled by w=1024 and y scaled by h=512 (evaluate.py:326,420-421) — it's actually an ellipse 2× more permissive vertically. Consistent with Stage 1, but should be documented.
Fix: greedy one-to-one matching (predictions sorted by confidence, each claims at most one unclaimed GT, duplicates counted as FP), optional interpolated AP. Then re-run the 1k-manual-gold-set evaluation with corrected matching and quantify the delta vs the published precision/recall; if material, add an erratum note to the README (and consider an arXiv v2 note).
From the post-release code review (July 2026).
Both model evaluators implement prediction→GT matching without one-to-one discipline, and the errors bias the published numbers upward:
stage_two/evaluate.py:419-433: the inner GT loop has nobreak— a single prediction within the matching radius of two ground-truth points appends two TP entries (inflates recall and the TP mass under the PR curve). Worse, a prediction whose only in-radius GTs are already claimed setspasses=Truebut appends nothing — standard protocols (VOC/COCO) count duplicate detections as false positives; here they vanish, inflating precision.stage_one/crop_model/ps_and_manual_model/evaluate.py:365-385: same duplicate-drop pattern (duplicate matches are neither TP nor FP).stage_one/dataset_evaluation/evaluate.py:144-163does this correctly (breaks after claiming the first unclaimed GT, buckets duplicates separately) — so the repo contains both the right and wrong pattern, and the Stage-2 evaluator (the one producing the paper's model PR curves) is the buggy variant.Related metric issues to fix in the same pass:
stage_two/evaluate.py:217-250) — not comparable to VOC/COCO AP.0.022 * heatmap_wagainst x scaled by w=1024 and y scaled by h=512 (evaluate.py:326,420-421) — it's actually an ellipse 2× more permissive vertically. Consistent with Stage 1, but should be documented.Fix: greedy one-to-one matching (predictions sorted by confidence, each claims at most one unclaimed GT, duplicates counted as FP), optional interpolated AP. Then re-run the 1k-manual-gold-set evaluation with corrected matching and quantify the delta vs the published precision/recall; if material, add an erratum note to the README (and consider an arXiv v2 note).
From the post-release code review (July 2026).