feat: map-reduce for large PRs#4
Conversation
…-> merge -> critic); stop stuffing full diff
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: b31bd65 → Head: 886c149
CHANGES_REQUESTED: large-PR review can skip mapped surfaces, drop blocking residual risks, and give the critic insufficient diff context.
Findings
P1: Review every mapped hotspot before returning a verdict
src/core/review.ts:93
Why this breaks: The map prompt allows up to 8 hotspots and requires every changed file to appear in one, but this line silently slices the mapped list to 6. On a large PR mapped into 7 or 8 surfaces, the omitted surfaces get no deep review, no findings, and no checked entry, so Needlefish can return PASS while some changed files were never reviewed.
Suggested fix: Process all returned hotspots, or make MAX_HOTSPOTS match the prompt and enforce that the reviewed hotspot union covers every changed file before deriving a verdict.
Validation: Add a unit/integration test with a mocked map result containing 7 hotspots and assert reviewLarge invokes the deep pass for all 7 or fails closed.
P1: Preserve residual risks from deep review passes
src/core/review.ts:113
Why this breaks: Each deep pass returns residual_risks, and deriveVerdict maps any blocks=true risk to needs_human. The large-review merge replaces all deep residual risks with an empty array, so a deep pass that could not verify a material guard no longer blocks the verdict and the final review can incorrectly pass.
Suggested fix: Accumulate res.residual_risks from each deep pass, include them in candidateMerged, let the critic prune them, and derive the final verdict from the pruned residual risks.
Validation: Mock one deep result with residual_risks:[{text:'unverified guard',blocks:true}] and assert the final ReviewResult verdict is needs_human unless the critic explicitly removes it.
P2: Give the critic the diff it is required to validate
src/core/review.ts:117
Why this breaks: critic.md says it receives the original diff and must delete findings not introduced by this diff or not tied to changed lines. In large mode this passes only --stat output, and candidateMerged also omits base/head SHAs, so the critic cannot verify changed-line anchoring for valid deep findings. That can cause real findings to be pruned as unverifiable, or weak findings to survive without the intended diff check.
Suggested fix: Pass the full diff for files that have candidate findings, or include base/head SHAs and update the critic prompt to run git diff for those files before pruning.
Validation: Run a large-review test where a deep pass emits a valid finding on a changed line and assert the critic prompt contains enough unified diff context, not only diffstat.
Checked
- Inspected the provided diff for src/core/review.ts, prompts/map.md, prompts/deep.md, prompts/critic.md usage, and schema changes.
- Verified map prompt allows up to 8 hotspots while reviewLarge reviews only the first 6.
- Verified reviewLarge drops deep residual_risks before critic/verdict by constructing candidateMerged with residual_risks: [].
- Verified reviewLarge passes bundle.patchStat at the critic PATCH slot in large mode instead of the original diff.
| throw new Error("map pass produced no hotspots"); | ||
| } | ||
|
|
||
| const hotspots = sortByRisk(mapResult.hotspots).slice(0, MAX_HOTSPOTS); |
There was a problem hiding this comment.
P1 (bug): Review every mapped hotspot before returning a verdict
The map prompt allows up to 8 hotspots and requires every changed file to appear in one, but this line silently slices the mapped list to 6. On a large PR mapped into 7 or 8 surfaces, the omitted surfaces get no deep review, no findings, and no checked entry, so Needlefish can return PASS while some changed files were never reviewed.
Suggested fix: Process all returned hotspots, or make MAX_HOTSPOTS match the prompt and enforce that the reviewed hotspot union covers every changed file before deriving a verdict.
| summary: mapResult.summary, | ||
| findings: merged, | ||
| checked, | ||
| residual_risks: [], |
There was a problem hiding this comment.
P1 (bug): Preserve residual risks from deep review passes
Each deep pass returns residual_risks, and deriveVerdict maps any blocks=true risk to needs_human. The large-review merge replaces all deep residual risks with an empty array, so a deep pass that could not verify a material guard no longer blocks the verdict and the final review can incorrectly pass.
Suggested fix: Accumulate res.residual_risks from each deep pass, include them in candidateMerged, let the critic prune them, and derive the final verdict from the pruned residual risks.
| }; | ||
| const criticPrompt = loadPrompt("critic.md") | ||
| .replace("{{FINDINGS}}", () => JSON.stringify(candidateMerged, null, 2)) | ||
| .replace("{{PATCH}}", () => bundle.patchStat || "(see git diff --stat; repo at head)"); |
There was a problem hiding this comment.
P2 (bug): Give the critic the diff it is required to validate
critic.md says it receives the original diff and must delete findings not introduced by this diff or not tied to changed lines. In large mode this passes only --stat output, and candidateMerged also omits base/head SHAs, so the critic cannot verify changed-line anchoring for valid deep findings. That can cause real findings to be pruned as unverifiable, or weak findings to survive without the intended diff check.
Suggested fix: Pass the full diff for files that have candidate findings, or include base/head SHAs and update the critic prompt to run git diff for those files before pruning.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 886c149439
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| throw new Error("map pass produced no hotspots"); | ||
| } | ||
|
|
||
| const hotspots = sortByRisk(mapResult.hotspots).slice(0, MAX_HOTSPOTS); |
There was a problem hiding this comment.
Review all mapped hotspots before passing
map.md permits up to 8 hotspots and requires every changed file to be placed in one, but this slice silently drops any surfaces after the first six. In a large PR with seven or eight independent surfaces, the omitted files never enter a deep pass or a blocking residual risk, so defects in those changed files cannot affect the final verdict. Review all mapped hotspots or explicitly return needs_human when any are skipped.
Useful? React with 👍 / 👎.
| summary: mapResult.summary, | ||
| findings: merged, | ||
| checked, | ||
| residual_risks: [], |
There was a problem hiding this comment.
Preserve blocking residual risks from deep passes
deep.md allows each hotspot review to return residual_risks, and deriveVerdict turns any blocks: true risk into needs_human; however this merge always replaces all hotspot risks with an empty array before the critic sees them. On large PRs where a hotspot cannot verify a material precondition, Needlefish can incorrectly post a passing review instead of escalating to a human. Carry forward and prune the collected residual risks.
Useful? React with 👍 / 👎.
| baseSha: bundle.baseSha, | ||
| headSha: bundle.headSha, | ||
| patchStat: bundle.patchStat, | ||
| changedFiles: bundle.changedFiles, | ||
| agentsMd: bundle.agentsMd, |
There was a problem hiding this comment.
Forward focus context into large-review prompts
For large diffs, this reduced map bundle is the only review context passed before the deep prompts, and it omits bundle.focus; the deep prompt also only receives the hotspot plus SHAs. As a result needlefish --focus security becomes a no-op whenever isLarge() selects this path, whereas small reviews still include the focus in the full bundle. Forward focus into the map/deep prompts.
Useful? React with 👍 / 👎.
…-hotspot try/catch + codex retry, forward deep residuals, tighter dedup, critic gets SHAs to re-verify guards
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ⛔ changes_requested
Base: b31bd65 → Head: 8362481
changes_requested: the large-review path drops existing review options, making --focus/--deep ineffective on large PRs.
Findings
P2: Forward review options into large mode
src/core/review.ts:83-89
Why this breaks: The local CLI still records deep and focus in the Bundle, and small reviews pass the full Bundle into the review prompt. The new large-review map bundle omits both fields, and the deep prompt only receives the hotspot plus SHAs, so needlefish --focus security and needlefish --deep become no-ops whenever isLarge() selects this path.
Suggested fix: Include the requested review options in the large-mode map/deep prompt context and update the map/deep prompts to honor them.
Validation: Run a large local review with --focus security --deep using a Codex stub and assert the captured map/deep prompts include those option values.
Checked
- Reviewed the supplied diff for the large-review pipeline and prompt inputs.
- Confirmed reviewSmall serializes the full Bundle, while reviewLarge constructs a reduced mapBundle without focus/deep and builds deep prompts only from AGENTS, HOTSPOT, BASE, and HEAD.
- No downstream guard-rejection claim was present, so no guard substitution check was required.
- Kept the finding because it is introduced by the changed large-review path, affects existing CLI contract behavior, is line-anchored, and has a minimal fix.
Residual Risk
- Did not execute the test suite; this review used read-only static inspection of the supplied diff.
|
|
||
| // Large PR: map (blast-radius survey, no diff text) -> deep per hotspot -> merge -> critic. | ||
| async function reviewLarge(bundle: Bundle): Promise<ReviewResult> { | ||
| const mapBundle = { |
There was a problem hiding this comment.
P2 (contract): Forward review options into large mode
The local CLI still records deep and focus in the Bundle, and small reviews pass the full Bundle into the review prompt. The new large-review map bundle omits both fields, and the deep prompt only receives the hotspot plus SHAs, so needlefish --focus security and needlefish --deep become no-ops whenever isLarge() selects this path.
Suggested fix: Include the requested review options in the large-mode map/deep prompt context and update the map/deep prompts to honor them.
There was a problem hiding this comment.
Needlefish PR Review
Verdict: ✅ pass
Base: b31bd65 → Head: a64b50c
approved: no high-confidence blocking findings remained after pruning.
Findings
No actionable findings. Prefer this over padding weak ones.
Checked
- Reviewed the candidate finding against src/core/review.ts large-review flow in the supplied diff.
- Confirmed map.md explicitly instructs the map pass to keep hotspots to <= 6 total, so outputs beyond MAX_HOTSPOTS are already prompt-invalid rather than an expected review surface.
- Confirmed the alleged skipped case depends on an omitted hotspot sharing files with a selected hotspot while only carrying distinct edges; deep.md still instructs each selected file surface to inspect changed hunks and trace downstream consumers via edges or rg, so the claimed never-reviewed effect is not concrete enough to block merge.
- Kept the non-blocking residual risk about TypeScript validation being unavailable in this checkout.
Residual Risk
- TypeScript validation could not be used in this checkout because tsc cannot resolve Node built-in types/@types/node; source review still reached a verdict.
Summary
Changes
Base: main