feat: harden review pipeline#6
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cce593c1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!patch.trim()) { | ||
| throw new Error(`No diff between ${mergeBase} and ${headSha}. Nothing to review.`); | ||
| } |
There was a problem hiding this comment.
Post a check result for empty diffs
When a GitHub review is run against a no-op PR or a manual dispatch after the branch already matches the base, this new throw happens before the try block that reports failures via postCheck. The workflow job exits without creating/updating the Needlefish check run, which can leave branch protection waiting on a missing check instead of getting the intended failure/neutral result; handle the empty-diff case inside the reporting path.
Useful? React with 👍 / 👎.
| if (lineEnd < lineStart) { | ||
| throw new Error("malformed finding: lineEnd before lineStart"); | ||
| } | ||
| const confidence = Math.max(0, Math.min(1, Number(record.confidence ?? 0))); |
There was a problem hiding this comment.
Reject non-finite confidence values
If the model returns a non-numeric confidence such as "high" for a P0/P1/P2 finding, Number(...) becomes NaN, the clamp still produces NaN, and confidence < 0.65 is false. That lets malformed blocking findings bypass the new confidence gate and request changes with NaN confidence (later serialized as null), so this should explicitly require Number.isFinite before accepting the value.
Useful? React with 👍 / 👎.
| if (mappedHotspots.length === 0) { | ||
| throw new Error("map pass produced no hotspots"); | ||
| } |
There was a problem hiding this comment.
Fall back when all mapped hotspots are filtered
When the map pass returns hotspots but changedHotspots() filters out every file (for example because the model used ./src/foo.ts paths or put only consumer files in files), this throw runs before the tail-coverage backstop below can review bundle.changedFiles. That turns a recoverable mapping-format issue in large/deep reviews into a failed review/check instead of still reviewing the changed files; let the backstop create a tail hotspot when mappedHotspots is empty.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: a7796d3 → Head: 8cce593
PR refactors Needlefish's CLI/review pipeline into shared repo/process/normalization modules, tightens prompt evidence contracts, and adds test/package/docs coverage. — PR refactors Needlefish's CLI/review pipeline into shared repo/process/normalization modules, tightens prompt evidence contracts, and adds test/package/docs coverage.
Findings
P1: Preserve deep-pass evidence before critic pruning
prompts/critic.md:15
Why this breaks: The changed critic contract at prompts/critic.md:15 requires a checked[] entry beginning with EVIDENCE finding: for every kept P0/P1/P2. The small-review prompt now requires that prefix at prompts/review.md:37, and deep reviews now have their own checked evidence contract at prompts/deep.md:16. But reviewLarge still discards res.checked at src/core/review.ts:152 and builds candidateMerged.checked from summary-only strings at src/core/review.ts:166 before invoking the critic. A legitimate deep P2 with checked evidence is therefore presented to the critic without per-finding evidence, so a compliant critic is instructed to delete it and the large-review verdict can pass incorrectly.
Suggested fix: When merging each deep result, preserve res.checked entries in the merged checked array without breaking the EVIDENCE finding: prefix, and align the deep prompt to emit that per-finding evidence prefix for blocking findings.
Validation: Add a reviewLarge test with stubbed map/deep/critic where deep returns a P2 plus EVIDENCE finding:; assert the critic input includes that evidence and the final blocking finding survives.
P2: Reject nonnumeric blocking confidence
src/shared/normalize.ts:147-148
Why this breaks: The new confidence gate computes confidence with Number(...), but nonnumeric input becomes NaN; because NaN < 0.65 is false, a malformed P0/P1/P2 finding with confidence: "bad" is returned instead of rejected. That malformed finding then reaches the consuming verdict predicate at src/core/verdict.ts:9, where severity alone produces changes_requested, so invalid model output can block a PR.
Suggested fix: Parse confidence into a temporary number, reject non-finite values with Number.isFinite, then clamp and apply the low-confidence severity gate.
Validation: Add a schema test for a P2 finding with nonnumeric confidence and run npm test.
Checked
- EVIDENCE finding:Reject nonnumeric blocking confidence changed=src/shared/normalize.ts:147 effect=Number("bad") becomes NaN, the low-confidence check at src/shared/normalize.ts:148 does not throw, and src/core/verdict.ts:9 turns P0/P1/P2 severity into changes_requested.
- EVIDENCE finding:Preserve deep-pass evidence before critic pruning changed=prompts/critic.md:15 effect=large-review candidateMerged.checked at src/core/review.ts:166 contains only per-hotspot summaries from src/core/review.ts:152, so a compliant critic sees no per-finding evidence for deep P0/P1/P2 findings.
- [Schema and normalization gates] Found one validation bug in the new normalization gate: malformed blocking confidence can bypass the low-confidence rejection and still block a PR.
- [Review orchestration and prompt contracts] One blocking contract regression: large-review deep evidence is dropped before the critic, so legitimate P0/P1/P2 findings can be pruned.
- [Execution adapters and bundle construction] No actionable defects found in this execution-adapter and bundle-construction surface.
- [Codex execution and JSON extraction] No correctness, contract, runtime, security, validation, duplicate, or regression findings in this surface.
- [CLI argument contract and dispatch] No correctness, contract, routing, or validation defects found in this CLI argument surface.
- [Verdict, classification tests, and package test script] No correctness findings in this surface.
Residual Risk
- Static review only; GitHub API endpoints were not exercised because the requested process was read-only inspection of this surface.
- Did not run the full test or typecheck suite; verification was limited to diff/source inspection and targeted runtime checks.
| if (lineEnd < lineStart) { | ||
| throw new Error("malformed finding: lineEnd before lineStart"); | ||
| } | ||
| const confidence = Math.max(0, Math.min(1, Number(record.confidence ?? 0))); |
There was a problem hiding this comment.
P2 (validation): Reject nonnumeric blocking confidence
The new confidence gate computes confidence with Number(...), but nonnumeric input becomes NaN; because NaN < 0.65 is false, a malformed P0/P1/P2 finding with confidence: "bad" is returned instead of rejected. That malformed finding then reaches the consuming verdict predicate at src/core/verdict.ts:9, where severity alone produces changes_requested, so invalid model output can block a PR.
Suggested fix: Parse confidence into a temporary number, reject non-finite values with Number.isFinite, then clamp and apply the low-confidence severity gate.
| - Re-check residual_risks: keep blocks=true only when it genuinely prevents a verdict (e.g. a deep pass that failed, leaving files unreviewed). | ||
|
|
||
| # Output | ||
| Before returning, ensure every kept P0/P1/P2 has a corresponding `checked[]` entry beginning with `EVIDENCE finding:`. If the evidence cannot be stated concretely, delete the finding. |
There was a problem hiding this comment.
P1 (bug): Preserve deep-pass evidence before critic pruning
The changed critic contract at prompts/critic.md:15 requires a checked[] entry beginning with EVIDENCE finding: for every kept P0/P1/P2. The small-review prompt now requires that prefix at prompts/review.md:37, and deep reviews now have their own checked evidence contract at prompts/deep.md:16. But reviewLarge still discards res.checked at src/core/review.ts:152 and builds candidateMerged.checked from summary-only strings at src/core/review.ts:166 before invoking the critic. A legitimate deep P2 with checked evidence is therefore presented to the critic without per-finding evidence, so a compliant critic is instructed to delete it and the large-review verdict can pass incorrectly.
Suggested fix: When merging each deep result, preserve res.checked entries in the merged checked array without breaking the EVIDENCE finding: prefix, and align the deep prompt to emit that per-finding evidence prefix for blocking findings.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5443ca4d1a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| throw new Error(`malformed finding: invalid confidence ${String(record.confidence)}`); | ||
| } | ||
| const confidence = Math.max(0, Math.min(1, rawConfidence)); | ||
| if (severity !== "P3" && confidence < 0.65) { |
There was a problem hiding this comment.
Use the documented confidence cutoff
When a model returns a blocking P0/P1/P2 finding with confidence in the 0.65–0.69 range, this accepts it even though the review prompt explicitly says findings below 0.70 must be downgraded to P3 or residual risk. Since deriveVerdict treats any accepted P0/P1/P2 as changes_requested, borderline low-confidence output can incorrectly block a PR; raise this cutoff to 0.70 to match the policy.
Useful? React with 👍 / 👎.
| if (mappedHotspots.length === 0) { | ||
| throw new Error("map pass produced no hotspots"); |
There was a problem hiding this comment.
Fall back when mapped hotspots miss changed files
If the map pass returns hotspots whose files do not exactly match the changed paths (for example ./src/app.ts, or only a consumer file), changedHotspots filters them all out and this throws before the tail-coverage backstop can add bundle.changedFiles. That turns a recoverable map miss into a failed review/check for --deep or large PRs; build the tail hotspot before deciding there is nothing to review.
Useful? React with 👍 / 👎.
| findings, | ||
| checked: rawChecked.map(String), |
There was a problem hiding this comment.
Enforce evidence for blocking findings
The new prompt contract requires every P0/P1/P2 finding to have a matching checked[] entry starting with EVIDENCE finding:<title>, but the normalizer accepts blocking findings without checking that proof exists. If the critic returns a P2 finding with a generic or missing checked entry, deriveVerdict still turns it into changes_requested, so malformed low-evidence output can block a PR despite the hard rule; reject or downgrade blocking findings that lack the required evidence entry.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: a7796d3 → Head: 5443ca4
Refactors Needlefish around parsed CLI commands, centralized repo/process helpers, async Codex review orchestration, stricter model-output normalization, and supporting tests/docs. — Refactors Needlefish around parsed CLI commands, centralized repo/process helpers, async Codex review orchestration, stricter model-output normalization, and supporting tests/docs.
Findings
P2: Keep tail coverage after filtering map hotspots
src/core/review.ts:118-119
Why this breaks: The new mappedHotspots guard rejects an otherwise recoverable map result before the existing tail-coverage path runs. If the map pass returns hotspots whose files are all non-changed consumers, changedHotspots filters them to [], this throw aborts the review, and the changed files never reach the tail hotspot at src/core/review.ts:127.
Suggested fix: Let tail coverage run when mappedHotspots is empty; only fail after adding uncovered changed files if there is still no hotspot to review.
Validation: Add a reviewLarge test where the fake map returns only files:["src/unchanged.ts"] while changedFiles contains src/app.ts; assert the deep prompt receives tail-coverage for src/app.ts and review completes.
Checked
- [Review orchestration, prompts, and Codex execution] One P2 finding: large-review can fail before tail coverage handles filtered map output.
- [Local/GitHub adapters and repo/process shelling] No correctness findings found in this review surface.
- [Model output schema and normalization] No findings; schema/normalization changes traced through the listed consumers without a concrete regression.
- [CLI command contract and option validation] No correctness findings for this CLI parser surface.
- [Node test script wiring] No merge-blocking issues found in the package test-script wiring surface.
Residual Risk
- Static inspection only; no build or test commands were run.
| if (mapResult.hotspots.length === 0) { | ||
| const mapResult = normalizeMap(extractJson(await runCodex(mapPrompt, { repoPath: bundle.repoPath }))); | ||
| const mappedHotspots = changedHotspots(mapResult.hotspots, bundle); | ||
| if (mappedHotspots.length === 0) { |
There was a problem hiding this comment.
P2 (bug): Keep tail coverage after filtering map hotspots
The new mappedHotspots guard rejects an otherwise recoverable map result before the existing tail-coverage path runs. If the map pass returns hotspots whose files are all non-changed consumers, changedHotspots filters them to [], this throw aborts the review, and the changed files never reach the tail hotspot at src/core/review.ts:127.
Suggested fix: Let tail coverage run when mappedHotspots is empty; only fail after adding uncovered changed files if there is still no hotspot to review.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 632f728ba0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!patch.trim()) { | ||
| throw new Error(`No diff between ${mergeBase} and ${headSha}. Nothing to review.`); | ||
| } |
There was a problem hiding this comment.
Post a check when GitHub diff is empty
When GitHub mode is invoked for a PR whose head has no effective diff against the merge base, this new throw runs before the try block that posts the failure check. The top-level CLI catch only writes stderr, so the documented Needlefish check run is never created or updated; if that check is required, the PR can remain pending instead of getting a clear failure/neutral result. Move this path under the checked error handling or post a check before returning.
Useful? React with 👍 / 👎.
|
|
||
| export async function review(bundle: Bundle): Promise<ReviewResult> { | ||
| return isLarge(bundle) ? reviewLarge(bundle) : reviewSmall(bundle); | ||
| return bundle.deep || isLarge(bundle) ? reviewLarge(bundle) : reviewSmall(bundle); |
There was a problem hiding this comment.
Preserve PR metadata in deep reviews
When a local review is run as needlefish --deep --pr 123, this newly routes even small diffs through reviewLarge, but that path builds the map/deep prompts without bundle.prMeta. The --pr contract says the PR body/comments/checks are pulled into the review context, so combining it with --deep now silently discards that metadata; pass prMeta into the large-review prompts or keep small deep reviews on the full-bundle path.
Useful? React with 👍 / 👎.
| - Re-check residual_risks: keep blocks=true only when it genuinely prevents a verdict (e.g. a deep pass that failed, leaving files unreviewed). | ||
|
|
||
| # Output | ||
| Before returning, ensure every kept P0/P1/P2 has a corresponding `checked[]` entry beginning with `EVIDENCE finding:`. If the evidence cannot be stated concretely, delete the finding. |
There was a problem hiding this comment.
Add deep evidence before critic pruning
For large diffs or --deep, candidate findings come from deep.md, whose new checked contract asks for FILES, EDGE, and trigger entries but not EVIDENCE finding: entries. This critic requirement can therefore cause valid P0/P1/P2 findings from the deep pass to be deleted solely because the upstream prompt never requested the required marker; add the same finding-evidence contract to the deep prompt or let the critic create the evidence entry after re-verifying.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: a7796d3 → Head: 632f728
This PR refactors Needlefish around shared CLI parsing, repo/process helpers, model-output normalization, and the large-review map/deep/critic pipeline. — Kept two concrete P2 findings: the new confidence contract is enforced too loosely, and --deep now drops --pr metadata for small local reviews.
Findings
P2: Raise the blocking-confidence cutoff to 0.70
src/shared/normalize.ts:152-153
Why this breaks: The review prompt contract says findings below 0.70 confidence must not be emitted as P0/P1/P2, but this changed predicate only rejects blocking findings below 0.65. Substitute a model finding with severity='P2' and confidence=0.66: normalizeFinding accepts it, then the consuming guard at src/core/verdict.ts:9 treats P2 as blocking and returns changes_requested, so a below-contract weak finding can fail the GitHub check.
Suggested fix: Use the same 0.70 cutoff as the prompt contract, or demote/reject P0/P1/P2 findings whose confidence is below 0.70 before verdict derivation.
Validation: Add a normalizeFinding test for P2 confidence 0.69 and run pnpm test.
P2: Preserve PR metadata in deep reviews
src/core/review.ts:178
Why this breaks: Setting bundle.deep now routes even small local reviews into reviewLarge. Local mode still populates bundle.prMeta from --pr at src/adapters/local.ts:77, but reviewLarge builds the map/deep prompts from a subset of the bundle at src/core/review.ts:106 and never includes prMeta; the deep prompt only receives AGENTS/HOTSPOT/FOCUS/BASE/HEAD at src/core/review.ts:142. The small path this bypasses serializes the full bundle at src/core/review.ts:95, so needlefish --pr 123 --deep silently drops PR body/comments/checks even though the CLI contract says --pr pulls them.
Suggested fix: Include prMeta in the map/deep prompt context, or keep --pr reviews on a path that passes the full bundle to the model.
Validation: Add a review test with deep=true and non-null prMeta whose fake Codex asserts that the map/deep prompt contains the PR body/comments.
Checked
- EVIDENCE finding:Raise the blocking-confidence cutoff to 0.70 changed=src/shared/normalize.ts:152 effect=P2 confidence 0.66 is accepted and deriveVerdict at src/core/verdict.ts:9 returns changes_requested
- EVIDENCE finding:Preserve PR metadata in deep reviews changed=src/core/review.ts:178 effect=--pr metadata loaded at src/adapters/local.ts:77 is routed into reviewLarge, whose map/deep prompt construction omits prMeta
Residual Risk
- Tests were not run because this pruning pass used read-only diff/source inspection.
| throw new Error(`malformed finding: invalid confidence ${String(record.confidence)}`); | ||
| } | ||
| const confidence = Math.max(0, Math.min(1, rawConfidence)); | ||
| if (severity !== "P3" && confidence < 0.65) { |
There was a problem hiding this comment.
P2 (validation): Raise the blocking-confidence cutoff to 0.70
The review prompt contract says findings below 0.70 confidence must not be emitted as P0/P1/P2, but this changed predicate only rejects blocking findings below 0.65. Substitute a model finding with severity='P2' and confidence=0.66: normalizeFinding accepts it, then the consuming guard at src/core/verdict.ts:9 treats P2 as blocking and returns changes_requested, so a below-contract weak finding can fail the GitHub check.
Suggested fix: Use the same 0.70 cutoff as the prompt contract, or demote/reject P0/P1/P2 findings whose confidence is below 0.70 before verdict derivation.
|
|
||
| export async function review(bundle: Bundle): Promise<ReviewResult> { | ||
| return isLarge(bundle) ? reviewLarge(bundle) : reviewSmall(bundle); | ||
| return bundle.deep || isLarge(bundle) ? reviewLarge(bundle) : reviewSmall(bundle); |
There was a problem hiding this comment.
P2 (contract): Preserve PR metadata in deep reviews
Setting bundle.deep now routes even small local reviews into reviewLarge. Local mode still populates bundle.prMeta from --pr at src/adapters/local.ts:77, but reviewLarge builds the map/deep prompts from a subset of the bundle at src/core/review.ts:106 and never includes prMeta; the deep prompt only receives AGENTS/HOTSPOT/FOCUS/BASE/HEAD at src/core/review.ts:142. The small path this bypasses serializes the full bundle at src/core/review.ts:95, so needlefish --pr 123 --deep silently drops PR body/comments/checks even though the CLI contract says --pr pulls them.
Suggested fix: Include prMeta in the map/deep prompt context, or keep --pr reviews on a path that passes the full bundle to the model.
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ✅ pass
Base: a7796d3 → Head: 15103d2
This PR refactors Needlefish's CLI/adapters into shared helpers, adds large-review map/deep orchestration with stricter prompt contracts, and moves model/PR JSON parsing into typed normalization. — This PR refactors Needlefish's CLI/adapters into shared helpers, adds large-review map/deep orchestration with stricter prompt contracts, and moves model/PR JSON parsing into typed normalization.
Findings
No actionable findings. Prefer this over padding weak ones.
Checked
- [Review orchestration, prompt contracts, and Codex execution] No concrete correctness, security, contract, runtime, or validation defects found in this review surface.
- [Adapter bundle construction and GitHub/local side effects] No reportable correctness, contract, runtime, security, validation, or duplicate-behavior defects found in this surface.
- [Model and PR output normalization with verdict schema] No actionable findings for this normalization/verdict surface.
- [CLI argument parser and command dispatch] No correctness findings in the CLI parser and dispatch surface.
Residual Risk
- Typecheck and tests could not be used as evidence in this environment because dependencies are not installed; review conclusion is based on diff and consumer tracing.
- Dynamic CLI execution/tests were not run; this review used the requested read-only diff and static consumer tracing.
Summary
Verification