hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11
Open
trisha-ant wants to merge 15 commits into
Open
hybrid_annot2: fix late-pretzel vs hatching confusion (66.3% -> 69.5%)#11trisha-ant wants to merge 15 commits into
trisha-ant wants to merge 15 commits into
Conversation
Kills the 'ahead' failure cluster: 33 end-of-series embryo_5 frames (x3 seeds = 99 predictions) where a shell-filling, vigorously moving late pretzel was called hatching and history anchoring locked the streak in. Fix is two-part, in both prompts: hatching is nearly instantaneous (rupture -> exit -> empty shell within a frame or two) while late pretzel wriggles WITHIN the shell, and a multi-frame hatching streak in history is self-contradictory - evidence the earlier calls were wrong, not something to continue. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 69.5% +/- 0.2 (hybrid_annot: 66.3 +/- 0.2), adjacent 87.8%. embryo_5 end-frames: 0/33 hatching misreads, was 33/33. hatched stays 100% - the guard does not delay real hatch detection. Seed1 of the run was recovered after an API 529 killed the original process mid-seed.
…sha/harness-rebuild
make_context only tried the platform default backend (X11 on Linux), which cannot work on headless machines. Try EGL second and surface both errors if neither backend works. Verified on a headless box with Mesa llvmpipe: smoke test passes, ~49ms per 512x512 render at 384 steps.
…based Hatching is not nearly instantaneous - it takes 2-3 minutes. How many frames show it depends entirely on imaging cadence: this series is imaged every ~4 minutes, so hatching spans at most a frame or two and may be missed outright, but a 20-second cadence would capture ~6 hatching frames. The operative rule (a multi-frame hatching streak at THIS frame rate is self-contradictory) is unchanged; only its justification is corrected. The recorded 69.5% run used the earlier wording - prompt_sha in its events.jsonl reflects that.
Hatching takes 2-3 minutes of real time; how many frames capture it depends on the acquisition interval (4-minute imaging: at most 1 frame, possibly missed; 20-second imaging: ~6 frames). Hardcoding 'a frame or two' bakes this series' cadence into the prompt. The solver now measures the median frame interval from the acquisition timestamps in the volume filenames and fills in the numbers at prompt build time: the hatching span, the hatching-streak contradiction length, and the 2fold->pretzel timing window (previously hardcoded '10-15 frames'). Falls back to qualitative phrasing when filenames carry no timestamps. hybrid_annotviews composes the same dynamic prompts.
embryos 5-8, claude-opus-4-6, 3 seeds: exact 66.5 +/- 4.0 (seeds 68.4/69.2/62.0), adjacent 85.2 - vs hybrid_annot2 69.5 +/- 0.2. 2fold dropped 26 -> 17.5, pretzel destabilized (65.3 +/- 9.7), hatching fix held (hatched 100%, embryo_5 cascade still dead). Third negative result for extra static views, now with true raymarched renders ruling out the projection-collapse explanation - extra static images dilute attention on this task rather than adding usable evidence. hybrid_annot2 remains the best solver at 69.5%.
view3d: model-callable camera for the annotator's raymarcher - yaw/pitch relative to his default pose plus the intensity-threshold knob; bounded, deterministic, golden-tested, content-address cached. hybrid_agentic3d: cadence-aware annot2 ruleset + view3d on a 4-step budget, prompt licensing answering without the tool when projections suffice. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 64.9 +/- 5.3 (seeds 69.8/65.5/59.3), adjacent 82.0 (hybrid_annot2: 69.5 +/- 0.2, 87.8). The model used the tool selectively as prompted (~33% of frames), but renders misled at the boundaries they were meant to resolve - seed2 re-opened the pretzel->hatched failure (70 frames) that annot2's text rules had fixed. Fourth negative result for 3D views; hybrid_annot2 remains the best solver at 69.5%.
…e result) view3d gains zoom_pct (up to 4x, hi-res 1024 internal render) centered on a model-picked point (center_x/y_pct) - full parity with the human annotator's viewer: rotate, zoom, click around, threshold. hybrid_explorer3d: cadence-aware annot2 ruleset + the extended tool on an 8-step survey->zoom->decide budget, plus an eggshell guard (renders show signal, not the shell wall - hatching calls stay owned by the projections and the elapsed-time rules). embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 67.1 +/- 1.1, adjacent 85.9 (hybrid_annot2: 69.5 +/- 0.2, 87.8). Best-behaved 3D variant (variance 5.3 -> 1.1, hatching relapse mostly gone) but still -2.4 vs text-only; dominant confusions unchanged (pretzel->2fold ~60/seed). Fifth negative for 3D input: the bottleneck is boundary calibration, not visual information.
…lat) Five contrastive transition descriptions (the first visible change at each boundary, derived from embryos 2-3 ground-truth transition frames - no eval leakage) layered onto hybrid_explorer3d's full viewer autonomy. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 67.0 +/- 1.6, adjacent 86.4 - ties hybrid_explorer3d (67.1), still -2.5 vs text-only hybrid_annot2 (69.5). Target stages unmoved (comma ~5%, bean ~29%, 1.5fold ~14%, 2fold ~18%). The contrastive text neither helped nor hurt under autonomy; text-only arm untested.
…nmoved) Temporal pairing: the model sees the previous frame's projection beside the current one, with the analysis reframed as did-a-boundary-just-get- crossed, plus the contrastive boundary vocabulary. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 68.9 +/- 1.3 (seed0 70.2), adjacent 87.0 - ties hybrid_annot2 (69.5 +/- 0.2). Decisive mechanistic null: failure clusters identical to annot2 (lag median 8 vs 9, behind 655 vs 641, windows missed 35 vs 33). The transition lag survives direct frame-to-frame comparison, so it is an under-updating decision behavior, not a perception gap. Input- side interventions are now thoroughly falsified; remaining leverage is in the decision rules.
Dropdown entries are prefixed with the run's week ([Week of 6/8]) and sorted newest-first so weeks cluster. Each report shows a description panel under the header with the solver docstring's first paragraph - what the experiment tried (results paragraphs deliberately excluded; the report's numbers speak for the outcome).
Minimal ablation isolating hybrid_pairwise's previous-frame image: system prompts byte-identical to hybrid_annot2, only the labeled previous-frame image added to the user turn. embryos 5-8 (2cfd8f4e), claude-opus-4-6, 3 seeds: exact 63.7 +/- 1.6, adjacent 85.0. Paired frame-level bootstrap vs annot2: -5.7 points, 95% CI [-7.2, -4.3] (the full pairwise bundle: -0.5, CI [-1.6, +0.5] - a tie). Confusions shift uniformly laggier. An unguided previous frame acts as a visual persistence anchor; pairwise's comparison-first instruction was not masking a benefit but repairing the damage. Temporal visual context requires explicit comparison framing.
The experiment panel now shows the solver docstring's complete pre-RESULT content as formatted paragraphs (motivation, what changed, how it compares) instead of only the first paragraph, plus a setup line (solver name, tools, one-shot vs agentic step budget). Rendered via DOM textContent, not innerHTML.
Frame modal now shows a '3D navigation' filmstrip for agentic runs: each view3d call as a numbered card with a human-readable caption (yaw/pitch/threshold/zoom@center) and the render the model saw at that step, followed by a labeled 'model response' block with the full classification reasoning. Also fixes a media collision: the harness keys tool-call media files by model-turn index, so parallel calls in one turn overwrote each other's image. view3d is deterministic, so the report re-renders every step from its recorded params (hitting the run's own dispatch cache) into per-step nav assets - verified distinct images for previously colliding steps.
docs/FINDINGS.md: the full experiment ladder (11 experiments with numbers), the findings (every win came from decision rules, every image-side change was flat or negative; transition lag is an under-updating decision behavior, not a perception gap; the labels are partly time-derived; infra lessons), and six suggested next steps. The report generator renders it to runs/findings.html with the report styling (minimal markdown renderer: headings, bold, code, tables, lists), and every report page gets a report|findings tab bar linking to it via relative path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #10, which merged at hybrid_annot (66.3% exact on embryos 5-8).
What this fixes
The largest remaining single failure was the 'ahead' cluster: the last 33 frames of embryo_5 (x3 seeds = 99 predictions) were classified as hatching while the ground truth stays pretzel. Two compounding causes, visible in the transcripts:
embryo_5 is the one embryo whose recording ends before hatch, so nothing ever corrected the streak.
The fix
Prompt-only, in both system prompts:
Results (embryos 5-8, claude-opus-4-6, 3 seeds)
Also included: a small render fix - make_context now falls back to EGL when the default GL backend fails, which is required on headless machines (verified against Mesa llvmpipe; smoke test passes at ~49ms per 512x512 render).