Skip to content

prompts: structural triggers for over-block + aggregate-budget recall#5

Merged
frankekn merged 2 commits into
mainfrom
feat/recall-overblock-budget
Jun 28, 2026
Merged

prompts: structural triggers for over-block + aggregate-budget recall#5
frankekn merged 2 commits into
mainfrom
feat/recall-overblock-budget

Conversation

@frankekn

Copy link
Copy Markdown
Owner

Problem

Two real bug classes pass per-line inspection because each individual guard
behaves correctly — yet the whole is a defect. On two prior PRs that needlefish
passed green, an independent review found real P2s of exactly these shapes.

Root cause (probed, not assumed)

Ran the unchanged review.md on both PRs and captured the model's trace. Both
were framing failures, not instruction failures — the prompt already said to
trace guards and check timeouts, and the model did the literal thing but asked
the wrong question:

  • Over-block (#105): the model substituted the changed predicate into a
    guard, saw it reject, and recorded that as correct behavior ("blocker
    predicate passes"). It never asked whether the rejected action was
    legitimately approvable on its own. A guard that rejects looks like success.
  • Aggregate budget (#106): the model verified per-iteration idempotency and
    skipped-accounting (all correct) but never summed worst-case loop latency
    against the caller's timeout. "Runtime" was read as per-call.

More prose would not fix this — the instructions already covered it. The fix
had to be structural: force the enumeration.

Change

A new "Required output when triggered" section in prompts/review.md with two
triggers, framed as required-output-when-triggered (blank when triggered ⇒
residual_risks blocks=true), mirroring the existing sentinel-value verdict
gate — not advisory prose. Cross-referenced from process steps 3 and 6.

  • TRIGGER A (over-block): when the diff changes a predicate gating an
    approve/submit/route/transition, enumerate every call-site of the gated
    action and, per distinct action, decide whether the predicate legitimately
    governs it. Rejecting a legitimately-approvable action = P2.
  • TRIGGER B (aggregate budget): when the diff adds a loop/repeated-await in
    a request handler, record the caller-timeout arithmetic (caller timeout ×
    per-iteration worst-case × max iterations) with source lines; flag if the
    product exceeds the timeout, especially after state was persisted.

No repo-specific vocabulary — both triggers are stated in general behavioral
terms (per AGENTS.local.md).

Probe results (gpt-5.5, N=1 each, small path — directional)

PR Before After
over-block case PASS found, framed correctly
budget case PASS found, framed correctly

3 additional findings surfaced across the two runs; all 3 verified real
against the code (a regex /scan/ over-match; a snapshot-column vs canonical
event-state skip). No false positives among extras to suppress.

Not measured (honest)

  • Precision on clean PRs is unmeasured — no negative control run yet. This
    is the landmine: the verdict is unforgiving (any P2 ⇒ changes_requested),
    so a spurious trigger firing on a clean PR flips it to a false block. Top
    follow-up before considering this mergeable as default behavior.
  • N=1 per case — one draw each; stability across runs not checked.
  • Large path (deep.md) not touched — same gap likely exists there; mirror
    pending.

Follow-ups

  • Clean-PR negative control (precision)
  • Mirror triggers into deep.md (large path)
  • Stability re-runs (N≥3)

Reviewed by a 2-advisor council (gpt-5.5, glm-5.2); both recommended
structural-not-prose. Eval rigor downgraded to lean-gate for v0.1 by the owner.

Two bug classes that pass per-line inspection because each individual
guard behaves correctly, yet are real defects:

- TRIGGER A (over-block): a changed gating predicate that rejects a
  legitimately-approvable action it should not govern. The trap is that
  "guard rejects the value" looks like correct behavior. Requires
  enumerating every call-site of the gated action and deciding, per
  distinct action, whether the predicate legitimately governs it.

- TRIGGER B (aggregate budget): a loop/repeated-await in a request
  handler whose worst-case total (per-iter latency x max iterations)
  exceeds the caller's timeout, especially after state was persisted.
  Requires recording the caller-timeout arithmetic with source lines.

Both are framed as required-output-when-triggered (blank when triggered
=> blocks), mirroring the existing sentinel-value verdict gate, not as
advisory prose. Cross-referenced from process steps 3 and 6.

Probed on two prior green-on-pass PRs (gpt-5.5, N=1 each, small path):
both flipped to changes_requested with the bug found and framed
correctly, plus 3 extra findings verified real against the code.
Precision on clean PRs is NOT yet measured — negative control TODO.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f2690f7db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread prompts/review.md
Comment on lines +16 to +17
# Required output when triggered (structural — not optional)
Two bug classes look "fine" under per-line inspection because each individual guard behaves correctly. They are real defects anyway. When EITHER trigger matches the diff, you MUST produce the enumeration below; a triggered section left blank or absent means you cannot conclude pass — put it under residual_risks with blocks=true.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply trigger checks to large-review prompt

When a review is classified as large (src/core/review.ts:30-31), Needlefish never loads prompts/review.md; it maps hotspots and runs prompts/deep.md instead (src/core/review.ts:119). Adding these required over-block/aggregate-budget checks only here means any PR over 30k patch chars or 10 changed files still skips the new enumeration and can miss the exact bug classes this change is meant to make blocking. Mirror the required-output trigger logic into deep.md or otherwise ensure the large path receives it.

Useful? React with 👍 / 👎.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needlefish PR Review

Verdict: ⛔ changes_requested
Base: 929297eHead: 3f2690f

CHANGES_REQUESTED: the structural trigger requirements are only added to the small-PR prompt, while the large-PR deep-review path still lacks them.

Findings

P2: Mirror the structural triggers into the deep-review prompt

prompts/review.md:16-31

Why this breaks: This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.

Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.

Validation: Inspect src/core/review.ts to confirm reviewLarge() loads prompts/deep.md for patch length >30000 or changed file count >10, then verify prompts/deep.md contains TRIGGER A and TRIGGER B.

Checked

  • Read the changed diff for prompts/review.md.
  • Verified reviewSmall loads prompts/review.md at src/core/review.ts:52.
  • Verified reviewLarge loads prompts/deep.md at src/core/review.ts:119.
  • Verified isLarge selects reviewLarge when patch length is greater than 30000 or changed file count is greater than 10 at src/core/review.ts:30.
  • Verified prompts/deep.md lacks TRIGGER A / TRIGGER B / over-block / aggregate-budget language.

Residual Risk

  • Did not execute Needlefish end-to-end because this review task is read-only and the defect is directly verifiable from prompt routing and prompt contents.

Comment thread prompts/review.md
- Verdict gate on new sentinel values: you may return PASS on a diff that adds or changes a nil/empty/zero/default value ONLY if, for each named downstream predicate that reads it, your output records the guard expression + file:line + "passes" or "rejects". If you did not open a guard for such a value, you cannot conclude pass — list it under residual_risks with blocks=true (never as a finding: a finding requires a verified rejection on a live path).
- If evidence is insufficient to verify something material, put it in residual_risks. Set blocks=true ONLY when the gap actually prevents a verdict.

# Required output when triggered (structural — not optional)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (validation): Mirror the structural triggers into the deep-review prompt

This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.

Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.

Addresses review on #5 (codex bot + needlefish self-review, same finding):
the triggers were only added to review.md (small path). Large PRs route
through reviewLarge() -> prompts/deep.md, so they bypassed the new
required-output triggers entirely.

Mirrors TRIGGER A (over-block) and TRIGGER B (aggregate budget) into
deep.md, adapted to its hotspot/edge/surface vocabulary, with the
cross-surface note that a gated call-site or caller-timeout may live in
a different surface than the changed predicate/loop — trace the edge and
cite it, don't skip. Cross-referenced from deep.md process steps 2 and 3.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needlefish PR Review

Verdict: ✅ pass
Base: 929297eHead: 20f046f

PASS: prompt and ignore-list changes add the new structural checks to both small and large review paths, with no concrete merge-blocking defect found.

Findings

No actionable findings. Prefer this over padding weak ones.

Checked

  • Read AGENTS policy from /home/termtek/.codex/RTK.md; no repository AGENTS.md exists.
  • Reviewed full diff for .gitignore, prompts/review.md, and prompts/deep.md.
  • Verified src/core/review.ts loads prompts/review.md for small reviews and prompts/deep.md for large hotspot reviews.
  • Searched for existing over-block and aggregate-budget trigger behavior; no duplicate implementation found.
  • Ran git diff --check for the PR range; no whitespace errors reported.

@frankekn

Copy link
Copy Markdown
Owner Author

Negative-control result (precision check)

Ran the fixed review.md (small path) on 3 merged & shipped-clean PRs that
match the trigger shapes (gating predicate / loop / control flow). Ground truth:
merged = best available "no real merge-blocking P2" signal.

PR shape verdict findings spurious?
#100 (tolerate missing repairable flag) flag/predicate PASS 0 ✅ trigger fired, correctly cleared
#102 (honest error states) helper signature FAIL 1× P1 real — diff removes request: Request param from proxyWorkerPost
#104 (inline approve + cover image) gating + URL lifecycle FAIL both real

Key points

  • 0 spurious findings. Every finding maps to a real change in the diff. No
    clean PR flipped to a false changes_requested.
  • #100 is the ideal precision signal: TRIGGER A engaged (enumerated 5
    gating call-sites) and correctly concluded legitimately governed — the
    trigger did work without inventing a defect.
  • #104 finding 1 is the over-block class on a different file/PR than #105
    (run page, wpNotReady gating inline approve). The trigger generalized the
    class rather than memorizing the #105 instance — directly answers the
    over-fit concern from AGENTS.local.md.
  • #102/#104 findings are real recall, not regressions — the old prompt plausibly
    catches the signature change too; the object-URL and over-block ones are the
    new triggers' contribution.

Honest scope

  • N=1 per PR — stability across draws not measured (follow-up).
  • "Merged = clean" is approximate ground truth; a real bug could have slipped
    through merge. But 0 spurious across 3 trigger-shaped diffs is a reasonable
    precision signal for v0.1.

Net: recall up on #105/#106, precision intact on 3 negatives. Precision
follow-up checkbox below can be checked.

@frankekn frankekn merged commit 3a91462 into main Jun 28, 2026
2 checks passed
@frankekn

Copy link
Copy Markdown
Owner Author

Stability re-runs (N=3, de-risking the earlier N=1)

Re-ran all 5 PRs × 3 draws each (15 runs) with the merged review.md.
Per-PR outcome stability:

PR class runs stable?
#105 over-block (positive) 3/3 found ✅ all real class
#106 budget (positive) 3/3 found ✅ all real class
#100 flag/predicate (negative) 3/3 PASS, 0 findings ✅ perfectly stable
#102 signature (negative) 3/3 same real P1 ✅ stable
#104 gating + URL (negative) 3/3 real findings, count varies 1–3 ⚠️ see below

Recall — stable

Both positive classes are found in all 3 runs, each framed as the correct
bug class (over-block rejection / caller-timeout overrun). No false pass in any
draw.

Precision — stable, zero spurious

Across 9 negative-control runs (3 PRs × 3), 0 spurious findings. Every
finding maps to a real diff change. #100 is the cleanest signal: trigger
engages and correctly clears, 3/3.

One honest weak spot — #104 over-block detection rate

The over-block gate finding on #104 (the generalizing finding — same class as
#105 on a different file) appeared in 2/3 runs (missed in r2). The
object-URL finding was 3/3. So the over-block trigger detects that instance
~67% of the time on a single draw. Mitigation in production: needlefish
re-runs on every push, so repeated exposure recovers the miss over a PR's
lifetime; but a single-shot review can miss it ~1/3 of the time.

Net

  • Recall: stable on both target classes.
  • Precision: stable, no false positives across 9 negative runs.
  • The merged change behaves as intended; the residual is a ~67% single-shot
    detection rate on one over-block instance, which repeated-push re-review
    partly absorbs.

N=3 is still small, but it's a meaningfully stronger signal than N=1 and the
direction is consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant