prompts: structural triggers for over-block + aggregate-budget recall by frankekn · Pull Request #5 · frankekn/needlefish

frankekn · 2026-06-28T05:55:11Z

Problem

Two real bug classes pass per-line inspection because each individual guard
behaves correctly — yet the whole is a defect. On two prior PRs that needlefish
passed green, an independent review found real P2s of exactly these shapes.

Root cause (probed, not assumed)

Ran the unchanged review.md on both PRs and captured the model's trace. Both
were framing failures, not instruction failures — the prompt already said to
trace guards and check timeouts, and the model did the literal thing but asked
the wrong question:

Over-block (#105): the model substituted the changed predicate into a
guard, saw it reject, and recorded that as correct behavior ("blocker
predicate passes"). It never asked whether the rejected action was
legitimately approvable on its own. A guard that rejects looks like success.
Aggregate budget (#106): the model verified per-iteration idempotency and
skipped-accounting (all correct) but never summed worst-case loop latency
against the caller's timeout. "Runtime" was read as per-call.

More prose would not fix this — the instructions already covered it. The fix
had to be structural: force the enumeration.

Change

A new "Required output when triggered" section in prompts/review.md with two
triggers, framed as required-output-when-triggered (blank when triggered ⇒
residual_risks blocks=true), mirroring the existing sentinel-value verdict
gate — not advisory prose. Cross-referenced from process steps 3 and 6.

TRIGGER A (over-block): when the diff changes a predicate gating an
approve/submit/route/transition, enumerate every call-site of the gated
action and, per distinct action, decide whether the predicate legitimately
governs it. Rejecting a legitimately-approvable action = P2.
TRIGGER B (aggregate budget): when the diff adds a loop/repeated-await in
a request handler, record the caller-timeout arithmetic (caller timeout ×
per-iteration worst-case × max iterations) with source lines; flag if the
product exceeds the timeout, especially after state was persisted.

No repo-specific vocabulary — both triggers are stated in general behavioral
terms (per AGENTS.local.md).

Probe results (gpt-5.5, N=1 each, small path — directional)

PR	Before	After
over-block case	PASS	found, framed correctly
budget case	PASS	found, framed correctly

3 additional findings surfaced across the two runs; all 3 verified real
against the code (a regex /scan/ over-match; a snapshot-column vs canonical
event-state skip). No false positives among extras to suppress.

Not measured (honest)

Precision on clean PRs is unmeasured — no negative control run yet. This
is the landmine: the verdict is unforgiving (any P2 ⇒ changes_requested),
so a spurious trigger firing on a clean PR flips it to a false block. Top
follow-up before considering this mergeable as default behavior.
N=1 per case — one draw each; stability across runs not checked.
Large path (deep.md) not touched — same gap likely exists there; mirror
pending.

Follow-ups

Clean-PR negative control (precision)
Mirror triggers into deep.md (large path)
Stability re-runs (N≥3)

Reviewed by a 2-advisor council (gpt-5.5, glm-5.2); both recommended
structural-not-prose. Eval rigor downgraded to lean-gate for v0.1 by the owner.

Two bug classes that pass per-line inspection because each individual guard behaves correctly, yet are real defects: - TRIGGER A (over-block): a changed gating predicate that rejects a legitimately-approvable action it should not govern. The trap is that "guard rejects the value" looks like correct behavior. Requires enumerating every call-site of the gated action and deciding, per distinct action, whether the predicate legitimately governs it. - TRIGGER B (aggregate budget): a loop/repeated-await in a request handler whose worst-case total (per-iter latency x max iterations) exceeds the caller's timeout, especially after state was persisted. Requires recording the caller-timeout arithmetic with source lines. Both are framed as required-output-when-triggered (blank when triggered => blocks), mirroring the existing sentinel-value verdict gate, not as advisory prose. Cross-referenced from process steps 3 and 6. Probed on two prior green-on-pass PRs (gpt-5.5, N=1 each, small path): both flipped to changes_requested with the bug found and framed correctly, plus 3 extra findings verified real against the code. Precision on clean PRs is NOT yet measured — negative control TODO.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f2690f7db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T05:58:09Z

+# Required output when triggered (structural — not optional)
+Two bug classes look "fine" under per-line inspection because each individual guard behaves correctly. They are real defects anyway. When EITHER trigger matches the diff, you MUST produce the enumeration below; a triggered section left blank or absent means you cannot conclude pass — put it under residual_risks with blocks=true.


Apply trigger checks to large-review prompt

When a review is classified as large (src/core/review.ts:30-31), Needlefish never loads prompts/review.md; it maps hotspots and runs prompts/deep.md instead (src/core/review.ts:119). Adding these required over-block/aggregate-budget checks only here means any PR over 30k patch chars or 10 changed files still skips the new enumeration and can miss the exact bug classes this change is meant to make blocking. Mirror the required-output trigger logic into deep.md or otherwise ensure the large path receives it.

Useful? React with 👍 / 👎.

github-actions

Needlefish PR Review

Verdict: ⛔ changes_requested
Base: 929297e → Head: 3f2690f

CHANGES_REQUESTED: the structural trigger requirements are only added to the small-PR prompt, while the large-PR deep-review path still lacks them.

Findings

P2: Mirror the structural triggers into the deep-review prompt

prompts/review.md:16-31

Why this breaks: This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.

Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.

Validation: Inspect src/core/review.ts to confirm reviewLarge() loads prompts/deep.md for patch length >30000 or changed file count >10, then verify prompts/deep.md contains TRIGGER A and TRIGGER B.

Checked

Read the changed diff for prompts/review.md.
Verified reviewSmall loads prompts/review.md at src/core/review.ts:52.
Verified reviewLarge loads prompts/deep.md at src/core/review.ts:119.
Verified isLarge selects reviewLarge when patch length is greater than 30000 or changed file count is greater than 10 at src/core/review.ts:30.
Verified prompts/deep.md lacks TRIGGER A / TRIGGER B / over-block / aggregate-budget language.

Residual Risk

Did not execute Needlefish end-to-end because this review task is read-only and the defect is directly verifiable from prompt routing and prompt contents.

github-actions · 2026-06-28T05:58:36Z

 - Verdict gate on new sentinel values: you may return PASS on a diff that adds or changes a nil/empty/zero/default value ONLY if, for each named downstream predicate that reads it, your output records the guard expression + file:line + "passes" or "rejects". If you did not open a guard for such a value, you cannot conclude pass — list it under residual_risks with blocks=true (never as a finding: a finding requires a verified rejection on a live path).
 - If evidence is insufficient to verify something material, put it in residual_risks. Set blocks=true ONLY when the gap actually prevents a verdict.

+# Required output when triggered (structural — not optional)


P2 (validation): Mirror the structural triggers into the deep-review prompt

This PR adds mandatory TRIGGER A/B review requirements only to prompts/review.md. Large reviews bypass that prompt: src/core/review.ts sends bundles over 30000 patch chars or over 10 files through reviewLarge(), which loads prompts/deep.md. Because prompts/deep.md still lacks the over-block and aggregate-budget requirements, large PRs can still return PASS without checking the same bug classes this change now requires for small PRs.

Suggested fix: Add the same triggered-output requirements and step cross-references to prompts/deep.md, adapted to the hotspot/surface wording.

Addresses review on #5 (codex bot + needlefish self-review, same finding): the triggers were only added to review.md (small path). Large PRs route through reviewLarge() -> prompts/deep.md, so they bypassed the new required-output triggers entirely. Mirrors TRIGGER A (over-block) and TRIGGER B (aggregate budget) into deep.md, adapted to its hotspot/edge/surface vocabulary, with the cross-surface note that a gated call-site or caller-timeout may live in a different surface than the changed predicate/loop — trace the edge and cite it, don't skip. Cross-referenced from deep.md process steps 2 and 3.

github-actions

Needlefish PR Review

Verdict: ✅ pass
Base: 929297e → Head: 20f046f

PASS: prompt and ignore-list changes add the new structural checks to both small and large review paths, with no concrete merge-blocking defect found.

Findings

No actionable findings. Prefer this over padding weak ones.

Checked

Read AGENTS policy from /home/termtek/.codex/RTK.md; no repository AGENTS.md exists.
Reviewed full diff for .gitignore, prompts/review.md, and prompts/deep.md.
Verified src/core/review.ts loads prompts/review.md for small reviews and prompts/deep.md for large hotspot reviews.
Searched for existing over-block and aggregate-budget trigger behavior; no duplicate implementation found.
Ran git diff --check for the PR range; no whitespace errors reported.

frankekn · 2026-06-28T06:51:52Z

Negative-control result (precision check)

Ran the fixed review.md (small path) on 3 merged & shipped-clean PRs that
match the trigger shapes (gating predicate / loop / control flow). Ground truth:
merged = best available "no real merge-blocking P2" signal.

PR	shape	verdict	findings	spurious?
#100 (tolerate missing repairable flag)	flag/predicate	PASS	0	✅ trigger fired, correctly cleared
#102 (honest error states)	helper signature	FAIL	1× P1	real — diff removes `request: Request` param from `proxyWorkerPost`
#104 (inline approve + cover image)	gating + URL lifecycle	FAIL	2×	both real

Key points

0 spurious findings. Every finding maps to a real change in the diff. No
clean PR flipped to a false changes_requested.
#100 is the ideal precision signal: TRIGGER A engaged (enumerated 5
gating call-sites) and correctly concluded legitimately governed — the
trigger did work without inventing a defect.
#104 finding 1 is the over-block class on a different file/PR than #105
(run page, wpNotReady gating inline approve). The trigger generalized the
class rather than memorizing the #105 instance — directly answers the
over-fit concern from AGENTS.local.md.
#102/#104 findings are real recall, not regressions — the old prompt plausibly
catches the signature change too; the object-URL and over-block ones are the
new triggers' contribution.

Honest scope

N=1 per PR — stability across draws not measured (follow-up).
"Merged = clean" is approximate ground truth; a real bug could have slipped
through merge. But 0 spurious across 3 trigger-shaped diffs is a reasonable
precision signal for v0.1.

Net: recall up on #105/#106, precision intact on 3 negatives. Precision
follow-up checkbox below can be checked.

frankekn · 2026-06-28T08:08:44Z

Stability re-runs (N=3, de-risking the earlier N=1)

Re-ran all 5 PRs × 3 draws each (15 runs) with the merged review.md.
Per-PR outcome stability:

PR	class	runs	stable?
#105	over-block (positive)	3/3 found	✅ all real class
#106	budget (positive)	3/3 found	✅ all real class
#100	flag/predicate (negative)	3/3 PASS, 0 findings	✅ perfectly stable
#102	signature (negative)	3/3 same real P1	✅ stable
#104	gating + URL (negative)	3/3 real findings, count varies 1–3	⚠️ see below

Recall — stable

Both positive classes are found in all 3 runs, each framed as the correct
bug class (over-block rejection / caller-timeout overrun). No false pass in any
draw.

Precision — stable, zero spurious

Across 9 negative-control runs (3 PRs × 3), 0 spurious findings. Every
finding maps to a real diff change. #100 is the cleanest signal: trigger
engages and correctly clears, 3/3.

One honest weak spot — #104 over-block detection rate

The over-block gate finding on #104 (the generalizing finding — same class as
#105 on a different file) appeared in 2/3 runs (missed in r2). The
object-URL finding was 3/3. So the over-block trigger detects that instance
~67% of the time on a single draw. Mitigation in production: needlefish
re-runs on every push, so repeated exposure recovers the miss over a PR's
lifetime; but a single-shot review can miss it ~1/3 of the time.

Net

Recall: stable on both target classes.
Precision: stable, no false positives across 9 negative runs.
The merged change behaves as intended; the residual is a ~67% single-shot
detection rate on one over-block instance, which repeated-push re-review
partly absorbs.

N=3 is still small, but it's a meaningfully stronger signal than N=1 and the
direction is consistent.

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

github-actions Bot requested changes Jun 28, 2026

View reviewed changes

github-actions Bot reviewed Jun 28, 2026

View reviewed changes

frankekn merged commit 3a91462 into main Jun 28, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prompts: structural triggers for over-block + aggregate-budget recall#5

prompts: structural triggers for over-block + aggregate-budget recall#5
frankekn merged 2 commits into
mainfrom
feat/recall-overblock-budget

frankekn commented Jun 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 28, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

frankekn commented Jun 28, 2026

Uh oh!

Uh oh!

frankekn commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Required output when triggered (structural — not optional)
		Two bug classes look "fine" under per-line inspection because each individual guard behaves correctly. They are real defects anyway. When EITHER trigger matches the diff, you MUST produce the enumeration below; a triggered section left blank or absent means you cannot conclude pass — put it under residual_risks with blocks=true.

Conversation

frankekn commented Jun 28, 2026

Problem

Root cause (probed, not assumed)

Change

Probe results (gpt-5.5, N=1 each, small path — directional)

Not measured (honest)

Follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Needlefish PR Review

Findings

P2: Mirror the structural triggers into the deep-review prompt

Checked

Residual Risk

Uh oh!

github-actions Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Needlefish PR Review

Findings

Checked

Uh oh!

frankekn commented Jun 28, 2026

Negative-control result (precision check)

Key points

Honest scope

Uh oh!

Uh oh!

frankekn commented Jun 28, 2026

Stability re-runs (N=3, de-risking the earlier N=1)

Recall — stable

Precision — stable, zero spurious

One honest weak spot — #104 over-block detection rate

Net

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant