Skip to content

hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11

Open
trisha-ant wants to merge 15 commits into
mainfrom
trisha/harness-rebuild
Open

hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11
trisha-ant wants to merge 15 commits into
mainfrom
trisha/harness-rebuild

Conversation

@trisha-ant

@trisha-ant trisha-ant commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Follow-up to #10, which merged at hybrid_annot (66.3% exact on embryos 5-8).

What this fixes

The largest remaining single failure was the 'ahead' cluster: the last 33 frames of embryo_5 (x3 seeds = 99 predictions) were classified as hatching while the ground truth stays pretzel. Two compounding causes, visible in the transcripts:

  1. A very late pretzel fills the eggshell, presses against it, and moves vigorously between frames. The model read that as emergence ("the body appears to extend beyond the eggshell boundary").
  2. Once one hatching call landed, history anchoring locked it in ("T144 and T145 were hatching, so the embryo has been hatching for at least 2 frames") - 30+ frames of self-reinforcing error.

embryo_5 is the one embryo whose recording ends before hatch, so nothing ever corrected the streak.

The fix

Prompt-only, in both system prompts:

  • Hatching takes 2-3 minutes, and this series is imaged every ~4 minutes - so hatching spans at most a frame or two and may be missed outright (the frame count is cadence-dependent: 20-second imaging would capture ~6 hatching frames). Late pretzel instead wriggles WITHIN the shell for a long time - which mimics emergence but is still pretzel as long as the bright mass stays shell-sized. (Kesavan's embryo_5 notes describe exactly this: "the worm moves a lot ... within the egg shell".)
  • A multi-frame hatching streak in the history is self-contradictory; treat it as evidence the earlier calls were wrong rather than something to continue.

Results (embryos 5-8, claude-opus-4-6, 3 seeds)

  • exact 69.5% +/- 0.2 (from 66.3 +/- 0.2), adjacent 87.8%
  • embryo_5 end frames: 0/33 hatching misreads, was 33/33 in every seed
  • hatched detection on embryos 6-8 stays 100% - the guard does not delay real hatches
  • pretzel 62% -> 70%, everything else unchanged within seed noise

Also included: a small render fix - make_context now falls back to EGL when the default GL backend fails, which is required on headless machines (verified against Mesa llvmpipe; smoke test passes at ~49ms per 512x512 render).

trisha-ant added 15 commits June 5, 2026 21:00
Kills the 'ahead' failure cluster: 33 end-of-series embryo_5 frames (x3
seeds = 99 predictions) where a shell-filling, vigorously moving late
pretzel was called hatching and history anchoring locked the streak in.
Fix is two-part, in both prompts: hatching is nearly instantaneous
(rupture -> exit -> empty shell within a frame or two) while late pretzel
wriggles WITHIN the shell, and a multi-frame hatching streak in history is
self-contradictory - evidence the earlier calls were wrong, not something
to continue.

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 69.5% +/- 0.2 (hybrid_annot: 66.3 +/- 0.2), adjacent 87.8%.
embryo_5 end-frames: 0/33 hatching misreads, was 33/33. hatched stays
100% - the guard does not delay real hatch detection. Seed1 of the run
was recovered after an API 529 killed the original process mid-seed.
make_context only tried the platform default backend (X11 on Linux),
which cannot work on headless machines. Try EGL second and surface both
errors if neither backend works. Verified on a headless box with Mesa
llvmpipe: smoke test passes, ~49ms per 512x512 render at 384 steps.
…based

Hatching is not nearly instantaneous - it takes 2-3 minutes. How many
frames show it depends entirely on imaging cadence: this series is imaged
every ~4 minutes, so hatching spans at most a frame or two and may be
missed outright, but a 20-second cadence would capture ~6 hatching frames.
The operative rule (a multi-frame hatching streak at THIS frame rate is
self-contradictory) is unchanged; only its justification is corrected.
The recorded 69.5% run used the earlier wording - prompt_sha in its
events.jsonl reflects that.
Hatching takes 2-3 minutes of real time; how many frames capture it
depends on the acquisition interval (4-minute imaging: at most 1 frame,
possibly missed; 20-second imaging: ~6 frames). Hardcoding 'a frame or
two' bakes this series' cadence into the prompt.

The solver now measures the median frame interval from the acquisition
timestamps in the volume filenames and fills in the numbers at prompt
build time: the hatching span, the hatching-streak contradiction length,
and the 2fold->pretzel timing window (previously hardcoded '10-15
frames'). Falls back to qualitative phrasing when filenames carry no
timestamps. hybrid_annotviews composes the same dynamic prompts.
embryos 5-8, claude-opus-4-6, 3 seeds: exact 66.5 +/- 4.0 (seeds
68.4/69.2/62.0), adjacent 85.2 - vs hybrid_annot2 69.5 +/- 0.2.
2fold dropped 26 -> 17.5, pretzel destabilized (65.3 +/- 9.7), hatching
fix held (hatched 100%, embryo_5 cascade still dead). Third negative
result for extra static views, now with true raymarched renders ruling
out the projection-collapse explanation - extra static images dilute
attention on this task rather than adding usable evidence.
hybrid_annot2 remains the best solver at 69.5%.
view3d: model-callable camera for the annotator's raymarcher - yaw/pitch
relative to his default pose plus the intensity-threshold knob; bounded,
deterministic, golden-tested, content-address cached.

hybrid_agentic3d: cadence-aware annot2 ruleset + view3d on a 4-step
budget, prompt licensing answering without the tool when projections
suffice.

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 64.9 +/- 5.3 (seeds 69.8/65.5/59.3), adjacent 82.0
(hybrid_annot2: 69.5 +/- 0.2, 87.8). The model used the tool selectively
as prompted (~33% of frames), but renders misled at the boundaries they
were meant to resolve - seed2 re-opened the pretzel->hatched failure
(70 frames) that annot2's text rules had fixed. Fourth negative result
for 3D views; hybrid_annot2 remains the best solver at 69.5%.
…e result)

view3d gains zoom_pct (up to 4x, hi-res 1024 internal render) centered
on a model-picked point (center_x/y_pct) - full parity with the human
annotator's viewer: rotate, zoom, click around, threshold.

hybrid_explorer3d: cadence-aware annot2 ruleset + the extended tool on
an 8-step survey->zoom->decide budget, plus an eggshell guard (renders
show signal, not the shell wall - hatching calls stay owned by the
projections and the elapsed-time rules).

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 67.1 +/- 1.1, adjacent 85.9 (hybrid_annot2: 69.5 +/- 0.2, 87.8).
Best-behaved 3D variant (variance 5.3 -> 1.1, hatching relapse mostly
gone) but still -2.4 vs text-only; dominant confusions unchanged
(pretzel->2fold ~60/seed). Fifth negative for 3D input: the bottleneck
is boundary calibration, not visual information.
…lat)

Five contrastive transition descriptions (the first visible change at
each boundary, derived from embryos 2-3 ground-truth transition frames -
no eval leakage) layered onto hybrid_explorer3d's full viewer autonomy.

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 67.0 +/- 1.6, adjacent 86.4 - ties hybrid_explorer3d (67.1),
still -2.5 vs text-only hybrid_annot2 (69.5). Target stages unmoved
(comma ~5%, bean ~29%, 1.5fold ~14%, 2fold ~18%). The contrastive text
neither helped nor hurt under autonomy; text-only arm untested.
…nmoved)

Temporal pairing: the model sees the previous frame's projection beside
the current one, with the analysis reframed as did-a-boundary-just-get-
crossed, plus the contrastive boundary vocabulary.

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 68.9 +/- 1.3 (seed0 70.2), adjacent 87.0 - ties hybrid_annot2
(69.5 +/- 0.2). Decisive mechanistic null: failure clusters identical
to annot2 (lag median 8 vs 9, behind 655 vs 641, windows missed 35 vs
33). The transition lag survives direct frame-to-frame comparison, so
it is an under-updating decision behavior, not a perception gap. Input-
side interventions are now thoroughly falsified; remaining leverage is
in the decision rules.
Dropdown entries are prefixed with the run's week ([Week of 6/8]) and
sorted newest-first so weeks cluster. Each report shows a description
panel under the header with the solver docstring's first paragraph -
what the experiment tried (results paragraphs deliberately excluded;
the report's numbers speak for the outcome).
Minimal ablation isolating hybrid_pairwise's previous-frame image:
system prompts byte-identical to hybrid_annot2, only the labeled
previous-frame image added to the user turn.

embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds:
exact 63.7 +/- 1.6, adjacent 85.0. Paired frame-level bootstrap vs
annot2: -5.7 points, 95% CI [-7.2, -4.3] (the full pairwise bundle:
-0.5, CI [-1.6, +0.5] - a tie). Confusions shift uniformly laggier.
An unguided previous frame acts as a visual persistence anchor;
pairwise's comparison-first instruction was not masking a benefit but
repairing the damage. Temporal visual context requires explicit
comparison framing.
The experiment panel now shows the solver docstring's complete
pre-RESULT content as formatted paragraphs (motivation, what changed,
how it compares) instead of only the first paragraph, plus a setup
line (solver name, tools, one-shot vs agentic step budget). Rendered
via DOM textContent, not innerHTML.
Frame modal now shows a '3D navigation' filmstrip for agentic runs:
each view3d call as a numbered card with a human-readable caption
(yaw/pitch/threshold/zoom@center) and the render the model saw at that
step, followed by a labeled 'model response' block with the full
classification reasoning.

Also fixes a media collision: the harness keys tool-call media files by
model-turn index, so parallel calls in one turn overwrote each other's
image. view3d is deterministic, so the report re-renders every step
from its recorded params (hitting the run's own dispatch cache) into
per-step nav assets - verified distinct images for previously colliding
steps.
docs/FINDINGS.md: the full experiment ladder (11 experiments with
numbers), the findings (every win came from decision rules, every
image-side change was flat or negative; transition lag is an
under-updating decision behavior, not a perception gap; the labels are
partly time-derived; infra lessons), and six suggested next steps.

The report generator renders it to runs/findings.html with the report
styling (minimal markdown renderer: headings, bold, code, tables,
lists), and every report page gets a report|findings tab bar linking to
it via relative path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant