Skip to content

Remove confidence from perception output and experiments #2

Description

@pskeshu

Context

The paper shows VLM self-reported confidence is uncalibrated noise:

  • Confidence when correct: 0.867
  • Confidence when wrong: 0.857
  • ECE: 0.524

The harness derives reliability from session history (stability, temporal consistency) instead.

Current state

confidence is kept as an optional field in PerceptionOutput with a default of 0.0 so existing experiments don't break. But several experiments still actively use it:

  • duration_aware.py — confidence-gated transitions (_confidence_gate function). The entire strategy is built around confidence thresholds. Needs rethinking — gate on stability or temporal context instead?
  • ensemble.py — averages confidence across 3 votes. Could use majority agreement count instead.
  • changegate.py — falls back to previous stage when confidence is low. Same replacement: use stability.
  • descriptive_multishot.py — includes confidence in the reconsideration prompt template.
  • run.py — prompts still ask VLM to return "confidence": 0.0-1.0 in JSON. Harmless but misleading.

What to do

  1. Remove confidence field from PerceptionOutput entirely
  2. Update prompt JSON schemas in all experiments to drop the confidence field
  3. Rewrite duration_aware.py to gate on harness-derived signals (stability, temporal analysis) instead of VLM confidence
  4. Simplify ensemble.py to report agreement count rather than averaged confidence
  5. Fix changegate.py similarly
  6. Update run.py and benchmark/metrics.py to remove any confidence-related metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions