hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%) by trisha-ant · Pull Request #11 · gently-project/gently-perception

trisha-ant · 2026-06-09T14:22:56Z

Follow-up to #10, which merged at hybrid_annot (66.3% exact on embryos 5-8).

What this fixes

The largest remaining single failure was the 'ahead' cluster: the last 33 frames of embryo_5 (x3 seeds = 99 predictions) were classified as hatching while the ground truth stays pretzel. Two compounding causes, visible in the transcripts:

A very late pretzel fills the eggshell, presses against it, and moves vigorously between frames. The model read that as emergence ("the body appears to extend beyond the eggshell boundary").
Once one hatching call landed, history anchoring locked it in ("T144 and T145 were hatching, so the embryo has been hatching for at least 2 frames") - 30+ frames of self-reinforcing error.

embryo_5 is the one embryo whose recording ends before hatch, so nothing ever corrected the streak.

The fix

Prompt-only, in both system prompts:

Hatching takes 2-3 minutes, and this series is imaged every ~4 minutes - so hatching spans at most a frame or two and may be missed outright (the frame count is cadence-dependent: 20-second imaging would capture ~6 hatching frames). Late pretzel instead wriggles WITHIN the shell for a long time - which mimics emergence but is still pretzel as long as the bright mass stays shell-sized. (Kesavan's embryo_5 notes describe exactly this: "the worm moves a lot ... within the egg shell".)
A multi-frame hatching streak in the history is self-contradictory; treat it as evidence the earlier calls were wrong rather than something to continue.

Results (embryos 5-8, claude-opus-4-6, 3 seeds)

exact 69.5% +/- 0.2 (from 66.3 +/- 0.2), adjacent 87.8%
embryo_5 end frames: 0/33 hatching misreads, was 33/33 in every seed
hatched detection on embryos 6-8 stays 100% - the guard does not delay real hatches
pretzel 62% -> 70%, everything else unchanged within seed noise

Also included: a small render fix - make_context now falls back to EGL when the default GL backend fails, which is required on headless machines (verified against Mesa llvmpipe; smoke test passes at ~49ms per 512x512 render).

Kills the 'ahead' failure cluster: 33 end-of-series embryo_5 frames (x3 seeds = 99 predictions) where a shell-filling, vigorously moving late pretzel was called hatching and history anchoring locked the streak in. Fix is two-part, in both prompts: hatching is nearly instantaneous (rupture -> exit -> empty shell within a frame or two) while late pretzel wriggles WITHIN the shell, and a multi-frame hatching streak in history is self-contradictory - evidence the earlier calls were wrong, not something to continue. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 69.5% +/- 0.2 (hybrid_annot: 66.3 +/- 0.2), adjacent 87.8%. embryo_5 end-frames: 0/33 hatching misreads, was 33/33. hatched stays 100% - the guard does not delay real hatch detection. Seed1 of the run was recovered after an API 529 killed the original process mid-seed.

…sha/harness-rebuild

make_context only tried the platform default backend (X11 on Linux), which cannot work on headless machines. Try EGL second and surface both errors if neither backend works. Verified on a headless box with Mesa llvmpipe: smoke test passes, ~49ms per 512x512 render at 384 steps.

…based Hatching is not nearly instantaneous - it takes 2-3 minutes. How many frames show it depends entirely on imaging cadence: this series is imaged every ~4 minutes, so hatching spans at most a frame or two and may be missed outright, but a 20-second cadence would capture ~6 hatching frames. The operative rule (a multi-frame hatching streak at THIS frame rate is self-contradictory) is unchanged; only its justification is corrected. The recorded 69.5% run used the earlier wording - prompt_sha in its events.jsonl reflects that.

Hatching takes 2-3 minutes of real time; how many frames capture it depends on the acquisition interval (4-minute imaging: at most 1 frame, possibly missed; 20-second imaging: ~6 frames). Hardcoding 'a frame or two' bakes this series' cadence into the prompt. The solver now measures the median frame interval from the acquisition timestamps in the volume filenames and fills in the numbers at prompt build time: the hatching span, the hatching-streak contradiction length, and the 2fold->pretzel timing window (previously hardcoded '10-15 frames'). Falls back to qualitative phrasing when filenames carry no timestamps. hybrid_annotviews composes the same dynamic prompts.

embryos 5-8, claude-opus-4-6, 3 seeds: exact 66.5 +/- 4.0 (seeds 68.4/69.2/62.0), adjacent 85.2 - vs hybrid_annot2 69.5 +/- 0.2. 2fold dropped 26 -> 17.5, pretzel destabilized (65.3 +/- 9.7), hatching fix held (hatched 100%, embryo_5 cascade still dead). Third negative result for extra static views, now with true raymarched renders ruling out the projection-collapse explanation - extra static images dilute attention on this task rather than adding usable evidence. hybrid_annot2 remains the best solver at 69.5%.

view3d: model-callable camera for the annotator's raymarcher - yaw/pitch relative to his default pose plus the intensity-threshold knob; bounded, deterministic, golden-tested, content-address cached. hybrid_agentic3d: cadence-aware annot2 ruleset + view3d on a 4-step budget, prompt licensing answering without the tool when projections suffice. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 64.9 +/- 5.3 (seeds 69.8/65.5/59.3), adjacent 82.0 (hybrid_annot2: 69.5 +/- 0.2, 87.8). The model used the tool selectively as prompted (~33% of frames), but renders misled at the boundaries they were meant to resolve - seed2 re-opened the pretzel->hatched failure (70 frames) that annot2's text rules had fixed. Fourth negative result for 3D views; hybrid_annot2 remains the best solver at 69.5%.

…e result) view3d gains zoom_pct (up to 4x, hi-res 1024 internal render) centered on a model-picked point (center_x/y_pct) - full parity with the human annotator's viewer: rotate, zoom, click around, threshold. hybrid_explorer3d: cadence-aware annot2 ruleset + the extended tool on an 8-step survey->zoom->decide budget, plus an eggshell guard (renders show signal, not the shell wall - hatching calls stay owned by the projections and the elapsed-time rules). embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 67.1 +/- 1.1, adjacent 85.9 (hybrid_annot2: 69.5 +/- 0.2, 87.8). Best-behaved 3D variant (variance 5.3 -> 1.1, hatching relapse mostly gone) but still -2.4 vs text-only; dominant confusions unchanged (pretzel->2fold ~60/seed). Fifth negative for 3D input: the bottleneck is boundary calibration, not visual information.

…lat) Five contrastive transition descriptions (the first visible change at each boundary, derived from embryos 2-3 ground-truth transition frames - no eval leakage) layered onto hybrid_explorer3d's full viewer autonomy. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 67.0 +/- 1.6, adjacent 86.4 - ties hybrid_explorer3d (67.1), still -2.5 vs text-only hybrid_annot2 (69.5). Target stages unmoved (comma ~5%, bean ~29%, 1.5fold ~14%, 2fold ~18%). The contrastive text neither helped nor hurt under autonomy; text-only arm untested.

…nmoved) Temporal pairing: the model sees the previous frame's projection beside the current one, with the analysis reframed as did-a-boundary-just-get- crossed, plus the contrastive boundary vocabulary. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 68.9 +/- 1.3 (seed0 70.2), adjacent 87.0 - ties hybrid_annot2 (69.5 +/- 0.2). Decisive mechanistic null: failure clusters identical to annot2 (lag median 8 vs 9, behind 655 vs 641, windows missed 35 vs 33). The transition lag survives direct frame-to-frame comparison, so it is an under-updating decision behavior, not a perception gap. Input- side interventions are now thoroughly falsified; remaining leverage is in the decision rules.

Dropdown entries are prefixed with the run's week ([Week of 6/8]) and sorted newest-first so weeks cluster. Each report shows a description panel under the header with the solver docstring's first paragraph - what the experiment tried (results paragraphs deliberately excluded; the report's numbers speak for the outcome).

Minimal ablation isolating hybrid_pairwise's previous-frame image: system prompts byte-identical to hybrid_annot2, only the labeled previous-frame image added to the user turn. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 63.7 +/- 1.6, adjacent 85.0. Paired frame-level bootstrap vs annot2: -5.7 points, 95% CI [-7.2, -4.3] (the full pairwise bundle: -0.5, CI [-1.6, +0.5] - a tie). Confusions shift uniformly laggier. An unguided previous frame acts as a visual persistence anchor; pairwise's comparison-first instruction was not masking a benefit but repairing the damage. Temporal visual context requires explicit comparison framing.

The experiment panel now shows the solver docstring's complete pre-RESULT content as formatted paragraphs (motivation, what changed, how it compares) instead of only the first paragraph, plus a setup line (solver name, tools, one-shot vs agentic step budget). Rendered via DOM textContent, not innerHTML.

Frame modal now shows a '3D navigation' filmstrip for agentic runs: each view3d call as a numbered card with a human-readable caption (yaw/pitch/threshold/zoom@center) and the render the model saw at that step, followed by a labeled 'model response' block with the full classification reasoning. Also fixes a media collision: the harness keys tool-call media files by model-turn index, so parallel calls in one turn overwrote each other's image. view3d is deterministic, so the report re-renders every step from its recorded params (hitting the run's own dispatch cache) into per-step nav assets - verified distinct images for previously colliding steps.

docs/FINDINGS.md: the full experiment ladder (11 experiments with numbers), the findings (every win came from decision rules, every image-side change was flat or negative; transition lag is an under-updating decision behavior, not a perception gap; the labels are partly time-derived; infra lessons), and six suggested next steps. The report generator renders it to runs/findings.html with the report styling (minimal markdown renderer: headings, bold, code, tables, lists), and every report page gets a report|findings tab bar linking to it via relative path.

trisha-ant added 15 commits June 5, 2026 21:00

Merge remote-tracking branch 'origin/trisha/harness-rebuild' into tri…

8b7e522

…sha/harness-rebuild

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11

hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11
trisha-ant wants to merge 15 commits into
mainfrom
trisha/harness-rebuild

trisha-ant commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trisha-ant commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

The fix

Results (embryos 5-8, claude-opus-4-6, 3 seeds)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trisha-ant commented Jun 9, 2026 •

edited

Loading