prompts: structural triggers for over-block + aggregate-budget recall#5
Conversation
Two bug classes that pass per-line inspection because each individual guard behaves correctly, yet are real defects: - TRIGGER A (over-block): a changed gating predicate that rejects a legitimately-approvable action it should not govern. The trap is that "guard rejects the value" looks like correct behavior. Requires enumerating every call-site of the gated action and deciding, per distinct action, whether the predicate legitimately governs it. - TRIGGER B (aggregate budget): a loop/repeated-await in a request handler whose worst-case total (per-iter latency x max iterations) exceeds the caller's timeout, especially after state was persisted. Requires recording the caller-timeout arithmetic with source lines. Both are framed as required-output-when-triggered (blank when triggered => blocks), mirroring the existing sentinel-value verdict gate, not as advisory prose. Cross-referenced from process steps 3 and 6. Probed on two prior green-on-pass PRs (gpt-5.5, N=1 each, small path): both flipped to changes_requested with the bug found and framed correctly, plus 3 extra findings verified real against the code. Precision on clean PRs is NOT yet measured — negative control TODO.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f2690f7db
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Required output when triggered (structural — not optional) | ||
| Two bug classes look "fine" under per-line inspection because each individual guard behaves correctly. They are real defects anyway. When EITHER trigger matches the diff, you MUST produce the enumeration below; a triggered section left blank or absent means you cannot conclude pass — put it under residual_risks with blocks=true. |
There was a problem hiding this comment.
Apply trigger checks to large-review prompt
When a review is classified as large (src/core/review.ts:30-31), Needlefish never loads prompts/review.md; it maps hotspots and runs prompts/deep.md instead (src/core/review.ts:119). Adding these required over-block/aggregate-budget checks only here means any PR over 30k patch chars or 10 changed files still skips the new enumeration and can miss the exact bug classes this change is meant to make blocking. Mirror the required-output trigger logic into deep.md or otherwise ensure the large path receives it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: 929297e → Head: 3f2690f
CHANGES_REQUESTED: the structural trigger requirements are only added to the small-PR prompt, while the large-PR deep-review path still lacks them.
Findings
P2: Mirror the structural triggers into the deep-review prompt
prompts/review.md:16-31
Why this breaks: This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.
Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.
Validation: Inspect src/core/review.ts to confirm reviewLarge() loads prompts/deep.md for patch length >30000 or changed file count >10, then verify prompts/deep.md contains TRIGGER A and TRIGGER B.
Checked
- Read the changed diff for prompts/review.md.
- Verified reviewSmall loads prompts/review.md at src/core/review.ts:52.
- Verified reviewLarge loads prompts/deep.md at src/core/review.ts:119.
- Verified isLarge selects reviewLarge when patch length is greater than 30000 or changed file count is greater than 10 at src/core/review.ts:30.
- Verified prompts/deep.md lacks TRIGGER A / TRIGGER B / over-block / aggregate-budget language.
Residual Risk
- Did not execute Needlefish end-to-end because this review task is read-only and the defect is directly verifiable from prompt routing and prompt contents.
| - Verdict gate on new sentinel values: you may return PASS on a diff that adds or changes a nil/empty/zero/default value ONLY if, for each named downstream predicate that reads it, your output records the guard expression + file:line + "passes" or "rejects". If you did not open a guard for such a value, you cannot conclude pass — list it under residual_risks with blocks=true (never as a finding: a finding requires a verified rejection on a live path). | ||
| - If evidence is insufficient to verify something material, put it in residual_risks. Set blocks=true ONLY when the gap actually prevents a verdict. | ||
|
|
||
| # Required output when triggered (structural — not optional) |
There was a problem hiding this comment.
P2 (validation): Mirror the structural triggers into the deep-review prompt
This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.
Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.
Addresses review on #5 (codex bot + needlefish self-review, same finding): the triggers were only added to review.md (small path). Large PRs route through reviewLarge() -> prompts/deep.md, so they bypassed the new required-output triggers entirely. Mirrors TRIGGER A (over-block) and TRIGGER B (aggregate budget) into deep.md, adapted to its hotspot/edge/surface vocabulary, with the cross-surface note that a gated call-site or caller-timeout may live in a different surface than the changed predicate/loop — trace the edge and cite it, don't skip. Cross-referenced from deep.md process steps 2 and 3.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ✅ pass
Base: 929297e → Head: 20f046f
PASS: prompt and ignore-list changes add the new structural checks to both small and large review paths, with no concrete merge-blocking defect found.
Findings
No actionable findings. Prefer this over padding weak ones.
Checked
- Read AGENTS policy from /home/termtek/.codex/RTK.md; no repository AGENTS.md exists.
- Reviewed full diff for .gitignore, prompts/review.md, and prompts/deep.md.
- Verified src/core/review.ts loads prompts/review.md for small reviews and prompts/deep.md for large hotspot reviews.
- Searched for existing over-block and aggregate-budget trigger behavior; no duplicate implementation found.
- Ran git diff --check for the PR range; no whitespace errors reported.
Negative-control result (precision check)Ran the fixed
Key points
Honest scope
Net: recall up on #105/#106, precision intact on 3 negatives. Precision |
Stability re-runs (N=3, de-risking the earlier N=1)Re-ran all 5 PRs × 3 draws each (15 runs) with the merged
Recall — stableBoth positive classes are found in all 3 runs, each framed as the correct Precision — stable, zero spuriousAcross 9 negative-control runs (3 PRs × 3), 0 spurious findings. Every One honest weak spot — #104 over-block detection rateThe over-block gate finding on #104 (the generalizing finding — same class as Net
N=3 is still small, but it's a meaningfully stronger signal than N=1 and the |
Problem
Two real bug classes pass per-line inspection because each individual guard
behaves correctly — yet the whole is a defect. On two prior PRs that needlefish
passed green, an independent review found real P2s of exactly these shapes.
Root cause (probed, not assumed)
Ran the unchanged
review.mdon both PRs and captured the model's trace. Bothwere framing failures, not instruction failures — the prompt already said to
trace guards and check timeouts, and the model did the literal thing but asked
the wrong question:
guard, saw it reject, and recorded that as correct behavior ("blocker
predicate passes"). It never asked whether the rejected action was
legitimately approvable on its own. A guard that rejects looks like success.
skipped-accounting (all correct) but never summed worst-case loop latency
against the caller's timeout. "Runtime" was read as per-call.
More prose would not fix this — the instructions already covered it. The fix
had to be structural: force the enumeration.
Change
A new "Required output when triggered" section in
prompts/review.mdwith twotriggers, framed as required-output-when-triggered (blank when triggered ⇒
residual_risks blocks=true), mirroring the existing sentinel-value verdictgate — not advisory prose. Cross-referenced from process steps 3 and 6.
approve/submit/route/transition, enumerate every call-site of the gated
action and, per distinct action, decide whether the predicate legitimately
governs it. Rejecting a legitimately-approvable action = P2.
a request handler, record the caller-timeout arithmetic (caller timeout ×
per-iteration worst-case × max iterations) with source lines; flag if the
product exceeds the timeout, especially after state was persisted.
No repo-specific vocabulary — both triggers are stated in general behavioral
terms (per
AGENTS.local.md).Probe results (gpt-5.5, N=1 each, small path — directional)
3 additional findings surfaced across the two runs; all 3 verified real
against the code (a regex
/scan/over-match; a snapshot-column vs canonicalevent-state skip). No false positives among extras to suppress.
Not measured (honest)
is the landmine: the verdict is unforgiving (any P2 ⇒
changes_requested),so a spurious trigger firing on a clean PR flips it to a false block. Top
follow-up before considering this mergeable as default behavior.
deep.md) not touched — same gap likely exists there; mirrorpending.
Follow-ups
deep.md(large path)Reviewed by a 2-advisor council (gpt-5.5, glm-5.2); both recommended
structural-not-prose. Eval rigor downgraded to lean-gate for v0.1 by the owner.