Remove confidence from perception output and experiments

## Context

The paper shows VLM self-reported confidence is uncalibrated noise:
- Confidence when correct: 0.867
- Confidence when wrong: 0.857
- ECE: 0.524

The harness derives reliability from session history (stability, temporal consistency) instead.

## Current state

`confidence` is kept as an optional field in `PerceptionOutput` with a default of 0.0 so existing experiments don't break. But several experiments still actively use it:

- **`duration_aware.py`** — confidence-gated transitions (`_confidence_gate` function). The entire strategy is built around confidence thresholds. Needs rethinking — gate on stability or temporal context instead?
- **`ensemble.py`** — averages confidence across 3 votes. Could use majority agreement count instead.
- **`changegate.py`** — falls back to previous stage when confidence is low. Same replacement: use stability.
- **`descriptive_multishot.py`** — includes confidence in the reconsideration prompt template.
- **`run.py`** — prompts still ask VLM to return `"confidence": 0.0-1.0` in JSON. Harmless but misleading.

## What to do

1. Remove `confidence` field from `PerceptionOutput` entirely
2. Update prompt JSON schemas in all experiments to drop the confidence field
3. Rewrite `duration_aware.py` to gate on harness-derived signals (stability, temporal analysis) instead of VLM confidence
4. Simplify `ensemble.py` to report agreement count rather than averaged confidence
5. Fix `changegate.py` similarly
6. Update `run.py` and `benchmark/metrics.py` to remove any confidence-related metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove confidence from perception output and experiments #2

Context

Current state

What to do

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Remove confidence from perception output and experiments #2

Description

Context

Current state

What to do

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions