Bump to claude-opus-4-7 + de-anchor scientific prompt by trisha-ant · Pull Request #7 · trisha-ant/gently-perception

trisha-ant · 2026-04-28T14:58:38Z

Stacked on #5 (→ #1).

Why

#5 established that z-slice segment-counting has no signal on claude-opus-4-6. The root cause — the VLM can't resolve 2-vs-3 blob structure in a z-slice — is exactly the kind of low-level perception task that claude-opus-4-7 is documented to be better at: it ships with high-resolution vision support (2576px long edge vs 1568px on 4.6) and improved pointing/counting/localization. This PR bumps the model and re-tests zslice_multi to see if the better vision rescues segment-counting.

It does not — but the migration itself surfaced a useful finding about 4.7's literal instruction-following, and the de-anchored prompt that came out of it is the foundation for #6.

What changed

perception/_base.py:

DEFAULT_MODEL = "claude-opus-4-6" → "claude-opus-4-7"
Drop temperature=temperature from both messages.create calls (4.7 returns 400 on temperature/top_p/top_k). Kept the kwarg in signatures so ensemble.py still imports; now a no-op.

perception/scientific.py — de-anchor for 4.7's literal instruction-following:

Rule 2: "Default to the same stage as previous" → "Use the previous observation as context, but classify based on the morphology in THIS image"
Rule 3 ("When in doubt, choose the EARLIER stage"): deleted
User-turn "most likely '{last_stage}' unless you see a clear morphological change" reinforcement block: deleted

Results (n=233 hard-stage frames)

Stage	scientific (4.6)	scientific (4.7, untuned)	scientific (4.7, de-anchored)	zslice_multi (4.7)
1.5fold (41)	65.9%	31.7%	80.5%	17.1%
2fold (59)	61.0%	50.8%	81.4%	54.2%
pretzel (133)	88.0%	43.6%	54.1%	42.9%
Exact	77.3%	43.3%	65.7%	41.2%
Adjacent	99.6%	93.6%	100%	96.1%

Why the raw bump collapsed scientific (77.3%→43.3%): 4.7 follows the prompt's three "default to previous / prefer earlier" instructions much more literally than 4.6. Confusion was systematically one stage backward (pretzel→2fold 75×, 1.5fold→comma 28×, 2fold→comma 15×), and because the runner feeds predicted history forward, one early under-call cascaded. De-anchoring recovered +22pp.

zslice_multi on 4.7: override fired 111×, 28 helpful / 83 harmful. max≥3 fires at 37% on GT=2fold vs 43% on GT=pretzel — still no separation. Segment-counting hypothesis falsified on both 4.6 and 4.7.

Also tried (not committed): a dark-gap 2fold/pretzel tiebreak — pretzel +3pp but 1.5fold −17pp / 2fold −8pp; reverted.

Takeaway

4.7 with the de-anchored prompt is the best result yet on 1.5fold/2fold (80%/81%), but pretzel regressed to 54% and the overall is −12pp vs 4.6. The 4.7 bump is not a net win at this commit — #6 closes that gap.

Verification

All 18 variants import cleanly on 4.7 (including ensemble which passes temperature= — now silently ignored)
scientific and zslice_multi run end-to-end on 233 frames; result JSONs + chart committed
Confusion matrix and override attribution computed from saved predictions

…truction-following scientific 65.7% (4.6: 77.3%). 1.5fold 80.5% / 2fold 81.4% are best-ever, but pretzel regressed 88%->54% (4.7 reads pretzel projections as "moderate fill" and calls 2fold). Net -12pp vs 4.6 until pretzel is fixed. Prompt changes (4.7 follows anchoring instructions much more literally, causing systematic one-stage-backward bias and cascade lock-in via predicted history): - Rule 2: "default to previous stage" -> "use as context, classify this image" - Rule 3 ("when in doubt, prefer earlier"): deleted - User-turn "most likely '{last_stage}'" reinforcement block: deleted _base.py: drop temperature from messages.create (4.7 returns 400 on sampling params); kept as no-op kwarg for ensemble.py back-compat. zslice_multi on 4.7: 41.2%, override 28 helpful / 83 harmful. max>=3 fires 37% on 2fold vs 43% on pretzel -- still no separation. Segment-counting hypothesis falsified on both 4.6 and 4.7. Also tried (not committed): dark-gap 2fold/pretzel tiebreak. Helped pretzel +3pp but cost 1.5fold -17pp / 2fold -8pp; reverted.

trisha-ant mentioned this pull request Apr 28, 2026

Opus 4.7 experiment loop + 4.6-vs-4.7 head-to-head: no significant overall difference; 4.7 wins 2fold #6

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump to claude-opus-4-7 + de-anchor scientific prompt#7

Bump to claude-opus-4-7 + de-anchor scientific prompt#7
trisha-ant wants to merge 1 commit into
trisha/gently-zslice-multifrom
trisha/gently-opus47-bump

trisha-ant commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trisha-ant commented Apr 28, 2026

Why

What changed

Results (n=233 hard-stage frames)

Takeaway

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant