Skip to content

Bump to claude-opus-4-7 + de-anchor scientific prompt#7

Open
trisha-ant wants to merge 1 commit into
trisha/gently-zslice-multifrom
trisha/gently-opus47-bump
Open

Bump to claude-opus-4-7 + de-anchor scientific prompt#7
trisha-ant wants to merge 1 commit into
trisha/gently-zslice-multifrom
trisha/gently-opus47-bump

Conversation

@trisha-ant

Copy link
Copy Markdown
Owner

Stacked on #5 (→ #1).

Why

#5 established that z-slice segment-counting has no signal on claude-opus-4-6. The root cause — the VLM can't resolve 2-vs-3 blob structure in a z-slice — is exactly the kind of low-level perception task that claude-opus-4-7 is documented to be better at: it ships with high-resolution vision support (2576px long edge vs 1568px on 4.6) and improved pointing/counting/localization. This PR bumps the model and re-tests zslice_multi to see if the better vision rescues segment-counting.

It does not — but the migration itself surfaced a useful finding about 4.7's literal instruction-following, and the de-anchored prompt that came out of it is the foundation for #6.

What changed

perception/_base.py:

  • DEFAULT_MODEL = "claude-opus-4-6""claude-opus-4-7"
  • Drop temperature=temperature from both messages.create calls (4.7 returns 400 on temperature/top_p/top_k). Kept the kwarg in signatures so ensemble.py still imports; now a no-op.

perception/scientific.py — de-anchor for 4.7's literal instruction-following:

  • Rule 2: "Default to the same stage as previous" → "Use the previous observation as context, but classify based on the morphology in THIS image"
  • Rule 3 ("When in doubt, choose the EARLIER stage"): deleted
  • User-turn "most likely '{last_stage}' unless you see a clear morphological change" reinforcement block: deleted

Results (n=233 hard-stage frames)

Stage scientific (4.6) scientific (4.7, untuned) scientific (4.7, de-anchored) zslice_multi (4.7)
1.5fold (41) 65.9% 31.7% 80.5% 17.1%
2fold (59) 61.0% 50.8% 81.4% 54.2%
pretzel (133) 88.0% 43.6% 54.1% 42.9%
Exact 77.3% 43.3% 65.7% 41.2%
Adjacent 99.6% 93.6% 100% 96.1%

Why the raw bump collapsed scientific (77.3%→43.3%): 4.7 follows the prompt's three "default to previous / prefer earlier" instructions much more literally than 4.6. Confusion was systematically one stage backward (pretzel→2fold 75×, 1.5fold→comma 28×, 2fold→comma 15×), and because the runner feeds predicted history forward, one early under-call cascaded. De-anchoring recovered +22pp.

zslice_multi on 4.7: override fired 111×, 28 helpful / 83 harmful. max≥3 fires at 37% on GT=2fold vs 43% on GT=pretzel — still no separation. Segment-counting hypothesis falsified on both 4.6 and 4.7.

Also tried (not committed): a dark-gap 2fold/pretzel tiebreak — pretzel +3pp but 1.5fold −17pp / 2fold −8pp; reverted.

Takeaway

4.7 with the de-anchored prompt is the best result yet on 1.5fold/2fold (80%/81%), but pretzel regressed to 54% and the overall is −12pp vs 4.6. The 4.7 bump is not a net win at this commit — #6 closes that gap.

Verification

  • All 18 variants import cleanly on 4.7 (including ensemble which passes temperature= — now silently ignored)
  • scientific and zslice_multi run end-to-end on 233 frames; result JSONs + chart committed
  • Confusion matrix and override attribution computed from saved predictions

…truction-following

scientific 65.7% (4.6: 77.3%). 1.5fold 80.5% / 2fold 81.4% are best-ever,
but pretzel regressed 88%->54% (4.7 reads pretzel projections as "moderate
fill" and calls 2fold). Net -12pp vs 4.6 until pretzel is fixed.

Prompt changes (4.7 follows anchoring instructions much more literally,
causing systematic one-stage-backward bias and cascade lock-in via
predicted history):
- Rule 2: "default to previous stage" -> "use as context, classify this image"
- Rule 3 ("when in doubt, prefer earlier"): deleted
- User-turn "most likely '{last_stage}'" reinforcement block: deleted

_base.py: drop temperature from messages.create (4.7 returns 400 on
sampling params); kept as no-op kwarg for ensemble.py back-compat.

zslice_multi on 4.7: 41.2%, override 28 helpful / 83 harmful. max>=3
fires 37% on 2fold vs 43% on pretzel -- still no separation.
Segment-counting hypothesis falsified on both 4.6 and 4.7.

Also tried (not committed): dark-gap 2fold/pretzel tiebreak. Helped
pretzel +3pp but cost 1.5fold -17pp / 2fold -8pp; reverted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant