#235: detector precision — stop mis-flagging correct external reviews by seungpyoson · Pull Request #237 · seungpyoson/relay

seungpyoson · 2026-06-21T15:10:09Z

What

Root 2 / Way 1 (detector precision). The review-quality gate was classifying genuinely-correct external reviews as failures via three over-broad detectors — the Arm B repro (scripts/ab/relay-quality-ab.mjs) nulled 5/7 known-correct reviews. This narrows each detector so correct reviews stop being flagged, while genuinely-bad reviews stay flagged so the buildReviewSlotDisposition demotion gate keeps preventing false approvals.

Each detector change only narrows its flag and fails toward flagging when in doubt.

Detectors tightened (`scripts/lib/review-prompt.mjs`)

permission_blocked — suppress only when a line praises the reviewed code's permission-error handling AND carries no review-process read-denial signal. A reviewer saying I cannot read the file still flags.
not_reviewed — a could-not-inspect gap attributed to a foreign file is out-of-scope (not a denial of the selected source) only when the review independently proves it inspected the selected source. Generic/selected-path denials, and gaps without proven inspection, stay flagged.
shallow_output — a short review is substantive only when one clause co-locates a concrete defect cue with a code locus and no negation. A bare Verdict: APPROVE stays shallow.

Safety

New 20-case adversarial corpus drives the real buildReviewAuditManifest entry point (the composed path, not leaf functions), asserting both directions: correct → not flagged; genuinely-bad → still failed_review_slot=true.
All 9 plugin/relay copies kept byte-identical via sync-review-prompt (--check exit 0).

Verification

node scripts/ci/sync-review-prompt.mjs --check → exit 0
npm run lint:sync → exit 0
894/894 targeted regression (review-prompt + every failed_review_slot consumer: review-panel, external-model-failure-core, provider-route-policy, contracts, plugin-copies-in-sync)
Full npm test (smoke/tunnel) delegated to CI.

Scope boundary

missing_verdict / prose-verdict is out of scope — owned by Root 3 (verdict grammar).
Way 2 (advisory disposition state machine) deferred to Add a 'counted-but-advisory' review disposition (surface borderline reviews for operator judgment instead of demoting to failed_slot) #236.

Closes #235
Refs #232

…correct reviews (#235) Root 2 / Way 1 (detector precision). The review-quality gate classified genuinely-correct external reviews as failures via three over-broad detectors (Arm B repro nulled 5/7 known-correct reviews). Narrow each so correct reviews stop being flagged while genuinely-bad reviews STAY flagged (the buildReviewSlotDisposition demotion gate must keep preventing false approvals). Each change only narrows its flag and fails toward flagging when in doubt. - permission_blocked: suppress only when a line praises the reviewed code's permission-error handling AND carries no review-process read-denial signal. A reviewer 'I cannot read the file' still flags. - not_reviewed: a could-not-inspect gap attributed to a foreign file is out-of-scope only when the review independently proves it inspected the selected source. Generic/selected-path denials, and gaps without proven inspection, stay flagged. - shallow_output: a short review is substantive only when one clause co-locates a concrete defect cue with a code locus and no negation. A bare 'Verdict: APPROVE' stays shallow. Adds a 20-case adversarial safety corpus driving the real buildReviewAuditManifest entry point, asserting both directions (correct -> not flagged; genuinely-bad -> still flagged). All 9 plugin copies kept byte-identical via sync-review-prompt. Way 2 (advisory disposition state machine) deferred to #236. Closes #235 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refines the review quality assessment logic across multiple plugins and relays by introducing more precise detectors for permission handling, out-of-scope inspection gaps, and substantive findings in short reviews. It also adds a comprehensive unit test suite to prevent regressions. The review feedback correctly identifies two critical bugs: a regex limitation in NON_SELECTED_FILE_TOKEN_RE that fails to match files with multiple dots (e.g., webpack.config.js), and a logical conflict where valid defect cues containing negations (like 'should not' or 'never') are erroneously flagged as shallow due to overlapping negation checks. Both comments provide highly valuable and actionable code suggestions to address these issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…overage baseline PR #237 CI coverage gate failed: plugins/*/scripts/lib/review-prompt.mjs line coverage 92.33% < baseline 95.73%. The new root2 detector1/2/3 corpus tests exercised only the shared-source buildReviewAuditManifest, leaving the new detector lines uncovered in the 5 plugin copies (the gate measures the copies, not the source). Parametrize the 20 corpus tests over all 6 REVIEW_PROMPT_MODULES using the file's existing loadReviewPromptModule convention, so the new detector code runs in every plugin copy. Copies now at 97.47% line (>= 95.73% baseline); 410 tests pass; sync-review-prompt --check clean (copies untouched). Refs #235 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…onarCloud S5852 (#235) The three review-quality concrete-finding locus regexes ran with unbounded */+ quantifiers on attacker-controllable external-review text. SonarCloud flagged the ident-call and ident.member regexes as javascript:S5852 (super-linear regex backtracking). Empirically all three are already linear (the lookbehind anchors prune start positions), but bound every quantifier so linearity is STRUCTURALLY provable and the analyzer no longer fires: path-prefix {0,255}, filename {1,128}, line-number {1,9}, identifier {0,128}, inter-token whitespace {0,16}. Bounding yields a strict SUBSET of the original language, so the Root-2 shallow-output detector still only narrows (fails toward flagging) by construction. Verified independently: 0 divergences on the realistic+boundary corpus, 0 subset violations on adversarial input, hasConcreteFinding identical on full reviews, linear timing to 256k chars (<0.6ms). Re-synced all eight plugin/relay copies (byte-identical). Added a ReDoS-timing guard (200k-char adversarial input under a 2000ms budget) and a long-but-realistic call-locus equivalence anchor, both parametrized over all six review-prompt modules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… file token + negation-cue conflict) Two HIGH-priority findings from the gemini-code-assist review of #237, both reproduced through the real buildReviewAuditManifest entry point before fixing. Detector 2 (not_reviewed) — NON_SELECTED_FILE_TOKEN_RE tokenized multi-dot filenames partially (webpack.config.js -> webpack.config), so a basename denial of a directory-prefixed multi-dot SELECTED source was mis-read as a foreign-file gap and suppressed, letting a genuine selected-source denial bypass the gate (unsafe direction). It now carries interior dotted segments whole via a bounded (?:\.seg){0,8} group; every quantifier is upper-bounded so the token scan stays linear-time (no new S5852). The fix flags more (safe direction) and is equivalent to the old regex on single-dot paths (0 token divergences). Detector 3 (shallow_output) — defect cues that legitimately contain negation words (never closed, does not handle, should not) were killed by CONCRETE_FINDING_NEGATION, mis-flagging valid concrete reviews as shallow. hasConcreteFinding now strips the matched defect cue(s) from the clause before the negation check (global strip, replaced with a space to keep word boundaries); genuine negations/absence (no off-by-one, correctly, missing nothing) sit outside the cue and survive, so praise/absence LGTMs still stay flagged. Re-synced across all 8 plugin/relay copies. Added both-direction regression tests (flag + still-suppress / not-shallow + still-shallow), parametrized over all six review-prompt modules. Verified: 414 review-prompt + 442 affected-unit + 696 smoke tests pass; lint:sync clean; copy coverage 97.60% (>= 95.73% baseline). Refs #235 #237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

…locked detectors Three external reviewers (Kimi, Claude, GPT) found unsafe false-negatives this PR introduced; an adversarial sweep then showed an enumeration patch just moved the error around. This replaces it with structural predicates and accepts an explicit, tracked residual (#238) for the surface-undecidable cases. shallow_output (hasConcreteFinding): drop the fragile cue-strip. A clause is a concrete finding only with cue + locus + NOT a confirmation (CONCRETE_FINDING_ PRAISE: '<cue> ... as expected/promised') + NOT a dismissal (CONCRETE_FINDING_ DISMISSAL: a negation BOUND to a defect noun within 2 words, or correctness- praise / LGTM — not a bare 'no'/'never' that merely appears in the finding). An adversative-tail override keeps a real defect after 'but/yet/whereas' from being suppressed. Fixes the 'should not <benign>' / 'as expected' leaks AND the 'no bounds check' / 'clean teardown' / 'none of the keys' false-positives. permission_blocked (reviewProcessBlockedSignal): normalize unicode apostrophes; add a first-person-anchored block regex (i/we + no-inspection cue + inspection verb/source-object, one sentence, bounded for linear time); REMOVE the bare artifact nouns ('source file'/'selected source'/...) and non-first-person 'read the X' phrases that fired on third-person code praise. Perception verbs see/saw/ view are excluded ('i did not see any issues' is praise, not a block). Fixes the contraction/paraphrase/unicode reviewer FNs AND the 8 bare-noun false-positives ('I inspected the selected source ...'), and clears pre-existing main FPs. Verified: oracle 0 decidable fails; 416/416 review-prompt + 179 affected-unit; lint:sync clean; copy coverage 97.65%; two adversarial sweeps (shallow leaks 24->9, FPs 15->4). Residual (surface-undecidable hedges/3rd-person blocks) and the PRE-EXISTING structural no-literal raising-detector gap are tracked for the Way-2 redesign (#236/#238), not patched by enumeration. Refs #235 #237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@200k

…cap (behavior-identical) The a2a7be1 redesign tripped SonarCloud new_maintainability_rating=C: four regexes over the S5843 complexity-20 cap (DISMISSAL 215, block 96, DEFECT_CUE 51, PRAISE 43) plus minor S6397/S6535/S6594. Decompose each mega-regex into small sub-patterns behind a helper, moving fixed phrases to includesAny string lists: - hasConcreteFinding: CONCRETE_FINDING_DISMISSAL -> clauseIsDismissal() over six small regexes (negated-defect / should-not / nothing / absence / correctly / looks) + an LGTM includesAny list; PRAISE simplified to bare 'as <confirmation>' (the cue is already required); contrast match -> firstContrastIndex() via indexOf (drops .match, fixes S6594). - CONCRETE_FINDING_DEFECT_CUE -> hasDefectCue() over three small sub-patterns. - reviewProcessBlockedSignal first-person regex -> REVIEWER_BLOCK_CUE + REVIEWER_BLOCK_TARGET (positional split via reviewerFirstPersonBlock()). - S6397: [\w]{0,4}->\w{0,4}, examine[d]?->examine(?:...)? ; S6535: drop \[ escape. Behavior-identical: oracle 0 decidable-fails, 416/416 review-prompt + 103 affected-unit, lint:sync clean, ReDoS still linear (17ms @200k). No test changes. Refs #235 #237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…plexity cap Follow-up to a2d77aa: the 6 regexes still over SonarCloud's S5843 cap (block-cue 40, negated-defect 32, block-target 23, correctly 24, absence 22, defect-verb 21) are split into paired sub-patterns whose union is identical. Block cue/target each split in two; the normalizer now also expands contractions to full forms (couldn't -> could not, requiring the apostrophe so 'important'/'content' are untouched) so the cue patterns need no apostrophe-variant branches. Dropped the dead n['o]?t negation arm (the leading word boundary made it unmatchable inside contractions). Behavior-identical: oracle 0 decidable-fails, 416/416 review-prompt + affected-unit, lint:sync clean, ReDoS still linear; corruption check confirms nt-ending words near a permission literal stay clean while curly couldn't/wasn't still flag. No test changes. Refs #235 #237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PR #237's not_reviewed foreign-path branch cleared a gap when a review named an out-of-packet file AND (claimed to have) inspected the selected source -- a suppression `main` flags. An adversarial sweep plus a 7-case main-vs-branch matrix proved it unfixable by enumeration: every text-based inspection signal is spoofable, and the only non-spoofable gate (named file provably absent from sourceFiles) regresses a no-engagement APPROVE (flag -> clean), breaking fail-toward-flagging. A disclaimer-enumeration hardening attempt (F3) was likewise leaky. Revert to main's contract: a could-not-inspect gap is suppressed ONLY when the reviewer EXPLICITLY marks it out of scope. Removes the foreign-path branch, selectedSourceInspected, namesNonSelectedFileGapLine, NON_SELECTED_FILE_TOKEN_RE, and the F3 disclaimer list across the source plus 8 synced copies. Net effect on not_reviewed is strictly toward flagging (verified == main on the matrix; adversarial pass: 0 confirmed bugs; residual holes pre-existing on main, tracked in #238). Keeps F2: DISMISSAL_SHOULD_NOT_PASSIVE restores the passive "should not be affected" dismissal alternative that the a5c2868 regex split silently dropped, with a split-identity property test pinning every pre-split alternative so a future split cannot narrow it again. Verified: review-prompt 416/416, full suite 2597 pass / 0 fail, coverage 85.00% met, sync --check rc=0, lint pass. Refs #232 #235 #236 #238 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tor oracles PR #237 round-2 review (GLM/GPT/Claude), every finding reproduced through buildReviewAuditManifest with a main-vs-branch delta: the a2a7be1 isPermissionHandlingPraiseLine suppressor let a GENUINELY permission-blocked APPROVE pass CLEAN whenever a line also praised the code's EACCES handling. Its enumerated un-suppression guard (reviewProcessBlockedSignal) missed a cross-line "read" block, a third-person block, two-word "can not", a U+FF07 glyph, and "my open" -- 5 live false-negatives (the gravest class: un-reviewed code clearing review). Revert the suppressor subsystem entirely (isPermissionHandlingPraiseLine + codeCorrectlyHandlesPermissionError + reviewerFirstPersonBlock + reviewProcessBlockedSignal + REVIEWER_BLOCK_*; all a2a7be1-added, absent on main, consumed only by the suppressor). Post-revert any line carrying a permission literal flags permission_blocked (== main). A review that merely praises EACCES handling is now conservatively over-flagged: an accepted SAFE false positive; the precision goal is Way-2 #238. Across source + 8 synced copies. Close the MAJOR oracle-gap finding (test-only, pins EXISTING behavior): add a hasDefectCue four-way split-identity oracle (every DEFECT_CUE_* alternative escapes shallow alone) and complete SHOULD_NOT_CANON to all 9 DISMISSAL_SHOULD_NOT_PASSIVE participles (was 5/10), so a future "behavior- identical" split cannot silently drop a cue/participle. NOT shipped (deferred to Way-2 #238 with reproductions): the dismissal-lexicon completions for negated-impact reassurance / negated defect-noun / cause-synonyms. An independent adversarial sweep proved they are surface-undecidable (the F1 class): a 3-way (main / PR-baseline / candidate) verification showed they over-suppress realistic REAL findings ("no overflow guard", "should not introduce X yet it does") -- 6 over-suppression regressions vs the PR baseline. Like F1, enumeration trades dismissal-only FN fixes for real-finding FPs; routed to the advisory-disposition redesign instead. Verified: review-prompt 418/418, full suite green, coverage 85% met, sync --check rc=0, lint pass. Refs #232 #235 #236 #238 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Brings #237 (Root 2 Way 1, #235 detector precision) current with main (9e1298e, post-#233). Single conflict in tests/unit/review-prompt.test.mjs resolved as a union: kept both the #235 detector-precision corpus (HEAD) and the #238 disposition guard (main, from already-merged #244) — complementary, not redundant. Propagated the merged review-prompt.mjs to the AGY copy (added by #218) via sync-review-prompt and regenerated relay build outputs. Verified: npm run lint exit 0 (all sync checks), full suite 2789/2783/0/6.

Restore the review-prompt module to the 1365a1e merge-ready baseline. The source-symbol grounding for concise concrete reviews (32cd266 / 0168a25) was unanimously blocked by the external fix-delta panel: spoofable by a fixed common-identifier template (~88% of packets) and it over-suppresses genuine prescriptive reviews. The concise-review false-approval hole is deferred to a deterministic-floor redesign (contain + escalate), tracked separately. Retains the independent agy review-prompt parity-test line from 83aabfa (orthogonal to grounding). review-prompt.test: 480/0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A <500-char review can no longer satisfy a slot via a concrete-finding-shaped clause (panel-proven ~88% spoofable + over-suppresses real reviews). Short non-tiny reviews now ride the existing looks_shallow -> failed_review_slot -> failed_slot path. Objective conciseTinyReview (tiny-source) exemption kept. Stage 2 (model judge) will restore safe short-review recognition.

sonarqubecloud · 2026-06-25T09:05:32Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread scripts/lib/review-prompt.mjs Outdated

Comment thread scripts/lib/review-prompt.mjs Outdated

Test User and others added 3 commits June 22, 2026 00:34

seungpyoson mentioned this pull request Jun 21, 2026

Root 2 follow-up: review-quality detector hardening (dead path/member loci + flaky background-cancel teardown) #238

Open

seungpyoson marked this pull request as ready for review June 21, 2026 17:29

greptile-apps Bot reviewed Jun 21, 2026

View reviewed changes

Test User and others added 12 commits June 22, 2026 05:04

fix: ground concise review loci in selected source

32cd266

test: cover agy review prompt copy

83aabfa

fix: require source grounding for concise concrete reviews

0168a25

revert: drop unrequested defect-cue lexicon expansion

5b676a1

seungpyoson merged commit 042c71d into main Jun 25, 2026
8 checks passed

seungpyoson deleted the fix/232-root2-discard branch June 25, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#235: detector precision — stop mis-flagging correct external reviews#237

#235: detector precision — stop mis-flagging correct external reviews#237
seungpyoson merged 16 commits into
mainfrom
fix/232-root2-discard

seungpyoson commented Jun 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seungpyoson commented Jun 21, 2026

What

Detectors tightened (scripts/lib/review-prompt.mjs)

Safety

Verification

Scope boundary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 25, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Detectors tightened (`scripts/lib/review-prompt.mjs`)