Bump to claude-opus-4-7 + de-anchor scientific prompt#7
Open
trisha-ant wants to merge 1 commit into
Open
Conversation
…truction-following
scientific 65.7% (4.6: 77.3%). 1.5fold 80.5% / 2fold 81.4% are best-ever,
but pretzel regressed 88%->54% (4.7 reads pretzel projections as "moderate
fill" and calls 2fold). Net -12pp vs 4.6 until pretzel is fixed.
Prompt changes (4.7 follows anchoring instructions much more literally,
causing systematic one-stage-backward bias and cascade lock-in via
predicted history):
- Rule 2: "default to previous stage" -> "use as context, classify this image"
- Rule 3 ("when in doubt, prefer earlier"): deleted
- User-turn "most likely '{last_stage}'" reinforcement block: deleted
_base.py: drop temperature from messages.create (4.7 returns 400 on
sampling params); kept as no-op kwarg for ensemble.py back-compat.
zslice_multi on 4.7: 41.2%, override 28 helpful / 83 harmful. max>=3
fires 37% on 2fold vs 43% on pretzel -- still no separation.
Segment-counting hypothesis falsified on both 4.6 and 4.7.
Also tried (not committed): dark-gap 2fold/pretzel tiebreak. Helped
pretzel +3pp but cost 1.5fold -17pp / 2fold -8pp; reverted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #5 (→ #1).
Why
#5 established that z-slice segment-counting has no signal on
claude-opus-4-6. The root cause — the VLM can't resolve 2-vs-3 blob structure in a z-slice — is exactly the kind of low-level perception task thatclaude-opus-4-7is documented to be better at: it ships with high-resolution vision support (2576px long edge vs 1568px on 4.6) and improved pointing/counting/localization. This PR bumps the model and re-testszslice_multito see if the better vision rescues segment-counting.It does not — but the migration itself surfaced a useful finding about 4.7's literal instruction-following, and the de-anchored prompt that came out of it is the foundation for #6.
What changed
perception/_base.py:DEFAULT_MODEL = "claude-opus-4-6"→"claude-opus-4-7"temperature=temperaturefrom bothmessages.createcalls (4.7 returns 400 ontemperature/top_p/top_k). Kept the kwarg in signatures soensemble.pystill imports; now a no-op.perception/scientific.py— de-anchor for 4.7's literal instruction-following:Results (n=233 hard-stage frames)
Why the raw bump collapsed scientific (77.3%→43.3%): 4.7 follows the prompt's three "default to previous / prefer earlier" instructions much more literally than 4.6. Confusion was systematically one stage backward (pretzel→2fold 75×, 1.5fold→comma 28×, 2fold→comma 15×), and because the runner feeds predicted history forward, one early under-call cascaded. De-anchoring recovered +22pp.
zslice_multi on 4.7: override fired 111×, 28 helpful / 83 harmful. max≥3 fires at 37% on GT=2fold vs 43% on GT=pretzel — still no separation. Segment-counting hypothesis falsified on both 4.6 and 4.7.
Also tried (not committed): a dark-gap 2fold/pretzel tiebreak — pretzel +3pp but 1.5fold −17pp / 2fold −8pp; reverted.
Takeaway
4.7 with the de-anchored prompt is the best result yet on 1.5fold/2fold (80%/81%), but pretzel regressed to 54% and the overall is −12pp vs 4.6. The 4.7 bump is not a net win at this commit — #6 closes that gap.
Verification
ensemblewhich passestemperature=— now silently ignored)scientificandzslice_multirun end-to-end on 233 frames; result JSONs + chart committed