[P1] Fix prediction-GT matching in both model evaluators (TP double-count / dropped duplicate FPs); re-quantify gold-set P/R

Both model evaluators implement prediction→GT matching without one-to-one discipline, and the errors bias the published numbers upward:

- `stage_two/evaluate.py:419-433`: the inner GT loop has **no `break`** — a single prediction within the matching radius of two ground-truth points appends **two TP entries** (inflates recall and the TP mass under the PR curve). Worse, a prediction whose only in-radius GTs are already claimed sets `passes=True` but appends **nothing** — standard protocols (VOC/COCO) count duplicate detections as false positives; here they vanish, inflating precision.
- `stage_one/crop_model/ps_and_manual_model/evaluate.py:365-385`: same duplicate-drop pattern (duplicate matches are neither TP nor FP).
- By contrast, `stage_one/dataset_evaluation/evaluate.py:144-163` does this **correctly** (breaks after claiming the first unclaimed GT, buckets duplicates separately) — so the repo contains both the right and wrong pattern, and the Stage-2 evaluator (the one producing the paper's model PR curves) is the buggy variant.

Related metric issues to fix in the same pass:
- AP is computed with **raw, uninterpolated** precision (`stage_two/evaluate.py:217-250`) — not comparable to VOC/COCO AP.
- The 0.022 "radius" is applied as `0.022 * heatmap_w` against x scaled by w=1024 and y scaled by h=512 (`evaluate.py:326,420-421`) — it's actually an **ellipse** 2× more permissive vertically. Consistent with Stage 1, but should be documented.

Fix: greedy one-to-one matching (predictions sorted by confidence, each claims at most one unclaimed GT, duplicates counted as FP), optional interpolated AP. Then **re-run the 1k-manual-gold-set evaluation with corrected matching and quantify the delta vs the published precision/recall**; if material, add an erratum note to the README (and consider an arXiv v2 note).

From the post-release code review (July 2026).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P1] Fix prediction-GT matching in both model evaluators (TP double-count / dropped duplicate FPs); re-quantify gold-set P/R #9

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[P1] Fix prediction-GT matching in both model evaluators (TP double-count / dropped duplicate FPs); re-quantify gold-set P/R #9

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions