Context
The paper shows VLM self-reported confidence is uncalibrated noise:
- Confidence when correct: 0.867
- Confidence when wrong: 0.857
- ECE: 0.524
The harness derives reliability from session history (stability, temporal consistency) instead.
Current state
confidence is kept as an optional field in PerceptionOutput with a default of 0.0 so existing experiments don't break. But several experiments still actively use it:
duration_aware.py — confidence-gated transitions (_confidence_gate function). The entire strategy is built around confidence thresholds. Needs rethinking — gate on stability or temporal context instead?
ensemble.py — averages confidence across 3 votes. Could use majority agreement count instead.
changegate.py — falls back to previous stage when confidence is low. Same replacement: use stability.
descriptive_multishot.py — includes confidence in the reconsideration prompt template.
run.py — prompts still ask VLM to return "confidence": 0.0-1.0 in JSON. Harmless but misleading.
What to do
- Remove
confidence field from PerceptionOutput entirely
- Update prompt JSON schemas in all experiments to drop the confidence field
- Rewrite
duration_aware.py to gate on harness-derived signals (stability, temporal analysis) instead of VLM confidence
- Simplify
ensemble.py to report agreement count rather than averaged confidence
- Fix
changegate.py similarly
- Update
run.py and benchmark/metrics.py to remove any confidence-related metrics
Context
The paper shows VLM self-reported confidence is uncalibrated noise:
The harness derives reliability from session history (stability, temporal consistency) instead.
Current state
confidenceis kept as an optional field inPerceptionOutputwith a default of 0.0 so existing experiments don't break. But several experiments still actively use it:duration_aware.py— confidence-gated transitions (_confidence_gatefunction). The entire strategy is built around confidence thresholds. Needs rethinking — gate on stability or temporal context instead?ensemble.py— averages confidence across 3 votes. Could use majority agreement count instead.changegate.py— falls back to previous stage when confidence is low. Same replacement: use stability.descriptive_multishot.py— includes confidence in the reconsideration prompt template.run.py— prompts still ask VLM to return"confidence": 0.0-1.0in JSON. Harmless but misleading.What to do
confidencefield fromPerceptionOutputentirelyduration_aware.pyto gate on harness-derived signals (stability, temporal analysis) instead of VLM confidenceensemble.pyto report agreement count rather than averaged confidencechangegate.pysimilarlyrun.pyandbenchmark/metrics.pyto remove any confidence-related metrics