Design loop: verify file references, require acceptance tests for methodology claims#633
Merged
Merged
Conversation
…hodology claims Third PR in the trio addressing bridge-analysis PR #438's failure modes. PR #631 hardened resolve. PR #632 hardened spec generation. This PR hardens the design loop — the earliest point in the pipeline where methodology claims and file references first get written down. If the design hallucinates "the existing logic lives in module X" when it's actually in notebooks/Y, that error propagates: into the spec, into the implementation, eventually into a shipped stub. Catching it at the design stage is cheaper than catching it in resolve. ## DEFAULT_SYSTEM_PROMPT additions - **Verify every file/function reference before citing it.** Use grep or read_file. If the canonical impl lives in a notebook, say so explicitly — don't pretend it's in a clean module. - **Methodology claims need named acceptance tests.** Any "uses BT+EB" / "applies BH-FDR" claim in the design must come with a proposed acceptance test in the analysis (e.g., test_leaderboard_matches_bt_eb_reference_within_tolerance with a tolerance value). Without it, the methodology is a label not a contract, and the implementer agent can ship a stand-in that the existing test suite won't catch. ## Tests 4 new tests in TestDesignPromptFidelityRules. 676 total pass. Same diagnostic synthesis as PR #631 and #632 — derived from the bridge-analysis agent's postmortem + my analysis. The three together address the failure mode at every layer it could be caught: design generation (this PR), spec generation (#632), implementation (#631).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Third PR in the trio addressing bridge-analysis PR #438's failure modes.
Why fix this here too
If the design hallucinates "the existing logic lives in module X" when it's actually in `notebooks/Y`, that error propagates: into the spec, into the implementation, eventually into a shipped stub. Catching it at the design stage is cheaper than catching it in resolve. The trio of PRs catches the failure mode at every layer it could be caught.
DEFAULT_SYSTEM_PROMPT additions
Test plan
🤖 Generated with Claude Code