Design loop: verify file references, require acceptance tests for methodology claims by gnovak · Pull Request #633 · gnovak/remote-dev-bot

gnovak · 2026-05-31T04:44:02Z

Third PR in the trio addressing bridge-analysis PR #438's failure modes.

PR Resolve prompt: methodology fidelity, end-to-end verification, write tests #631 hardened resolve (methodology faithfulness, end-to-end verification, write tests for new code, softened anti-exploration language)
PR Spec generation: verify file refs, ban placeholders, require methodology tests #632 hardened spec generation (verify file refs, ban `...` placeholders, require named methodology tests)
This PR hardens the design loop — the earliest point where methodology claims and file references first get written down

Why fix this here too

If the design hallucinates "the existing logic lives in module X" when it's actually in `notebooks/Y`, that error propagates: into the spec, into the implementation, eventually into a shipped stub. Catching it at the design stage is cheaper than catching it in resolve. The trio of PRs catches the failure mode at every layer it could be caught.

DEFAULT_SYSTEM_PROMPT additions

Verify every file/function reference before citing it. Use grep or read_file. If the canonical impl lives in a notebook, say so explicitly — don't pretend it's in a clean module.
Methodology claims need named acceptance tests. Any "uses BT+EB" / "applies BH-FDR" claim in the design must come with a proposed acceptance test in the analysis (e.g., `test_leaderboard_matches_bt_eb_reference_within_tolerance` with a tolerance value). Without it, the methodology is a label not a contract.

Test plan

4 new tests in `TestDesignPromptFidelityRules`
676 unit tests total pass
Empirical: next `/agent-design` or `/agent-delegate` should produce designs that name file paths precisely and include acceptance tests for methodology claims

🤖 Generated with Claude Code

…hodology claims Third PR in the trio addressing bridge-analysis PR #438's failure modes. PR #631 hardened resolve. PR #632 hardened spec generation. This PR hardens the design loop — the earliest point in the pipeline where methodology claims and file references first get written down. If the design hallucinates "the existing logic lives in module X" when it's actually in notebooks/Y, that error propagates: into the spec, into the implementation, eventually into a shipped stub. Catching it at the design stage is cheaper than catching it in resolve. ## DEFAULT_SYSTEM_PROMPT additions - **Verify every file/function reference before citing it.** Use grep or read_file. If the canonical impl lives in a notebook, say so explicitly — don't pretend it's in a clean module. - **Methodology claims need named acceptance tests.** Any "uses BT+EB" / "applies BH-FDR" claim in the design must come with a proposed acceptance test in the analysis (e.g., test_leaderboard_matches_bt_eb_reference_within_tolerance with a tolerance value). Without it, the methodology is a label not a contract, and the implementer agent can ship a stand-in that the existing test suite won't catch. ## Tests 4 new tests in TestDesignPromptFidelityRules. 676 total pass. Same diagnostic synthesis as PR #631 and #632 — derived from the bridge-analysis agent's postmortem + my analysis. The three together address the failure mode at every layer it could be caught: design generation (this PR), spec generation (#632), implementation (#631).

gnovak merged commit de72856 into dev Jun 3, 2026

gnovak deleted the design-loop-verify-references branch June 13, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design loop: verify file references, require acceptance tests for methodology claims#633

Design loop: verify file references, require acceptance tests for methodology claims#633
gnovak merged 1 commit into
devfrom
design-loop-verify-references

gnovak commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnovak commented May 31, 2026

Why fix this here too

DEFAULT_SYSTEM_PROMPT additions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant