feat: make the inner debug agent effective for web QA + run controls#9
Conversation
Driving the loop end-to-end against a real app surfaced several blockers that
unit tests couldn't (the agent never converged to a report). Fixed the loop so a
fast model (glm-5.2) can actually drive a web app and return useful findings, and
added the run-control surface.
Inner-loop fixes:
- act: flat input schema instead of z.discriminatedUnion — anyOf JSON-Schema made
glm emit empty {} args, so act never executed. Per-action requirements now
enforced at runtime.
- browser adapter: normalizeQuery maps LLM-natural targets (role+name, plain text)
onto Playwright role=/text= engines instead of only raw CSS.
- observe(tree): returns a ready-to-use `target` per node (role+name or text, with
>> nth= for duplicates) so the model copies a working selector instead of guessing.
- findings-store: cap screenshot-name slug at 60 chars (a long look question hit
ENAMETOOLONG and killed look).
- loop: step-budget nudge so the driver reports before the 30-step cap; agent.log
step + tool-error logging for observability.
Run controls:
- start_debug: always-on wall-clock timeout (default 300s, overridable via `timeout`).
- CLI `status` / `stop` over a new state.json breadcrumb; SIGTERM/SIGINT end the run
gracefully (abort loop, close browser, free profile).
Docs + fixtures:
- move idea/ -> docs/idea/; add docs/claude/SKILL.md (driving claude headless).
- rewrite README (MCP tools, CLI, timeout, skill); exclude dummy/ from biome.
- dummy/web: deliberately-buggy React+SCSS QA fixture (answer key in BUGS.md),
used to validate the loop with a tool-scoped `claude -p` black-box QA run.
291 tests pass; typecheck + lint clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 33 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughAdds cross-process run state persistence ( ChangesCore source changes
Dummy web QA fixture
Sequence Diagram(s)sequenceDiagram
participant Client as MCP Client
participant Tool as start_debug tool
participant DS as DebugService
participant SP as FileStatePort
participant Timer as Wall-clock timer
Client->>Tool: start_debug({ target, goal, timeout })
Tool->>DS: start({ target, goal, timeoutMs })
DS->>SP: record({ sessionId, pid, target, goal })
DS->>Timer: armTimeout(timeoutMs)
Note over DS,Timer: timer fires → endActive()
Timer-->>DS: endActive()
DS->>SP: clear() → markStatus('ended')
DS->>DS: clearTimeout()
sequenceDiagram
participant CLI as cli status/stop
participant Control as runStatus / runStop
participant FS as state.json / findings.json
participant Proc as Server PID
CLI->>Control: runStatus(cwd)
Control->>FS: readState(statePath)
Control->>Proc: kill(pid, 0) — liveness check
Control->>FS: readFindings (live count)
Control-->>CLI: print summary
CLI->>Control: runStop(cwd)
Control->>FS: readState(statePath)
Control->>Proc: SIGTERM (if alive)
Control->>FS: markStatus('stopped')
Control-->>CLI: log result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (3)
src/agent/belt/act.ts (1)
48-76: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake the act input schema strict at the MCP boundary.
z.object(...)will silently drop unknown operands here, so payloads like{ action: 'navigate', url: '...' }degrade into a laterrequires 'target'runtime error instead of failing at parse time.z.strictObject(...)preserves the flat shape while keeping the boundary fail-fast.As per coding guidelines, "Use Zod at every boundary, including config, MCP inputs, and findings."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/belt/act.ts` around lines 48 - 76, The ActInputSchema in act.ts is too permissive because z.object(...) drops unknown fields, so invalid MCP payloads can pass parsing and fail later with misleading runtime errors. Make the boundary fail-fast by switching ActInputSchema to a strict Zod object shape, keeping the same action/target/text/key/direction/amount/networkIdle/timeout fields while rejecting unknown operands like url at parse time.Source: Coding guidelines
src/cli/control.test.ts (1)
59-69: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover the live-PID stop path too.
These tests only exercise the missing-state and dead-PID branches. Please add one test that stubs
process.killas alive and asserts both theSIGTERMattempt and thestoppedstate update.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli/control.test.ts` around lines 59 - 69, Add a new stop-path test in control.test.ts that covers the live PID branch in runStop: stub process.kill so the target PID appears alive, then assert that the stop flow attempts SIGTERM and updates the persisted state to stopped. Reuse the existing runStop, seedState, and readState helpers so the new test sits alongside the current no-state and dead-server cases and validates the alive-process behavior end to end.src/main.ts (1)
17-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftMove CLI subcommand dispatch out of
src/main.ts.This file now owns both stdio-server bootstrap and control-CLI routing. Keeping
init/status/stopin a dedicated entrypoint preserves the repo’s intended seam and keepssrc/main.tssingle-purpose.As per coding guidelines, "
src/main.tsshould only boot the stdio MCP server."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.ts` around lines 17 - 29, Move the subcommand routing logic out of main and keep src/main.ts focused only on stdio MCP server startup. Extract the init/status/stop dispatch currently in main() into a dedicated CLI entrypoint and leave src/main.ts with just the server bootstrap path. Use the existing main() function and runInit/runStatus/runStop symbols to relocate the control-CLI handling without changing behavior.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dummy/web/BUGS.md`:
- Around line 11-13: Clarify the React 18 StrictMode note in BUGS.md so it only
refers to duplicated mount effects/remounts in dev, not click handlers. Update
the wording near the StrictMode note to avoid implying the product-4 click error
is a StrictMode duplicate, and keep the guidance scoped to the behavior
triggered by mount effects/remounts.
In `@dummy/web/src/components/ProductCard.module.scss`:
- Line 62: The inline comment in ProductCard.module.scss violates
scss/double-slash-comment-empty-line-before because it is not preceded by a
blank line. Update the SCSS around the affected comment so there is an empty
line before the double-slash comment, keeping the existing styling intent
intact.
In `@dummy/web/src/components/ProductGrid.module.scss`:
- Line 23: The SCSS comment in ProductGrid.module.scss is violating stylelint
due to inconsistent spacing and missing the required empty line before a
double-slash comment. Update the comment formatting near the affected rule so it
has a blank line before the comment and uses consistent spacing, keeping the fix
localized to the stylesheet comment block.
In `@README.md`:
- Around line 116-117: The README still references the old command name, so
update the CLI usage text to use the actual binary name consistently. Search the
CLI intro and related sections around the ui-debugger init wording, and replace
those references with ui-debugger-mcp so the documented command matches the new
binary name everywhere it appears.
In `@src/adapters/browser/query.ts`:
- Around line 20-27: The pre-`text=` detection in `ROLE_NAME` and `CSS_LIKE` is
too permissive and is misclassifying normal labels before the plain-text
fallback in `query.ts`. Narrow the heuristics in the query parsing path so only
clearly intended role-name or CSS selectors are matched, and ensure ambiguous
inputs like quoted phrases or labels containing `>`/`~` still fall through to
text handling. Update the relevant selector classification logic around
`ROLE_NAME` and `CSS_LIKE` so `Sale "50% off"` and similar cases are not treated
as `role=` or CSS.
In `@src/agent/belt/act.ts`:
- Around line 140-143: The `type` branch in `act.ts` should fail fast on missing
input by validating `input.text` with `required()` before calling `resolve()`,
so a bad selector does not mask the real `requires 'text'` error and an
unnecessary adapter lookup is avoided. Update the `case 'type'` flow to check
`text` first, then resolve `input.target` and call `adapter.type(node, text)`
only after both required values are present.
In `@src/agent/belt/observe.ts`:
- Around line 144-152: withTargets() is generating target selectors from scoped
read results, but observe/act are not preserving that scope when replaying the
target. Update the observe.ts flow around withTargets(), readState(), and act so
that scoped reads (within/filters) either do not emit target or include enough
scope information for act to resolve the same element instead of using an
unscoped find({ query: target }).
In `@src/agent/prompts/web-addendum.ts`:
- Around line 53-59: The guidance in web-addendum.ts incorrectly says every node
from observe({kind:"tree"}) has a target, but withTargets() can leave target
undefined for unnamed, non-semantic nodes. Update the wording around the
observe({kind:"tree"})/act({action, target}) instructions to refer only to
actionable nodes, and keep the fallback sentence for nodes without target
consistent with that behavior.
In `@src/cli/control.ts`:
- Around line 25-32: The stop flow in runStop() should not trust state.pid
alone, because it may signal a reused PID and then incorrectly mark the run
stopped even if teardown failed. Persist an additional process identity in
state.json from the server startup path and verify it before calling
process.kill in isAlive() and runStop(); if the identity does not match, skip
signaling and treat it as a stale record. Update the error handling so only the
ESRCH race is considered benign, and make sure the stop status is only written
after a verified successful signal or a confirmed already-dead process.
In `@src/config/load.ts`:
- Around line 111-117: The loadWorkspaceDir helper is swallowing all config
parse/validation errors and falling back to DEFAULT_WORKSPACE, which hides
invalid .ui-debugger-mcp.json data. Keep the existing existsSync/missing-file
fallback, but change the try/catch around parseProject(readFileSync(...)) so
only a missing workspace field defaults while other config errors are surfaced
(either by validating just workspace here or by rethrowing the parseProject
error). Use loadWorkspaceDir and parseProject as the key symbols to update.
In `@src/mcp/tools/start-debug.ts`:
- Around line 48-63: The start-debug tool currently accepts any positive timeout
and converts it to milliseconds in the async handler, which can overflow Node’s
maximum timer delay. Add an upper bound to the Zod schema for timeout in
start-debug.ts so the value stays within safe seconds before it reaches
service.start and gets multiplied into timeoutMs. Use the existing timeout field
definition in the tool schema to enforce the cap at validation time.
In `@src/services/session-builder.ts`:
- Around line 167-169: The crash-path logging is still fire-and-forget, so the
failure details can be lost before agent.log is flushed. Update run() to await
the start/end/error appendLog writes instead of dropping them via logAgent(),
and make sure the error path waits for those writes before rethrowing or
returning. Use the existing logAgent helper and the run() try/catch/finally flow
to locate and replace the best-effort logging.
In `@src/session/state-file.ts`:
- Around line 113-120: The markStatus function is overwriting an existing
terminal state, causing a prior stopped breadcrumb to be replaced by ended
during later shutdown. Update markStatus so that when it is called with status
"ended" and readState(path) already returns a prior record with status
"stopped", it preserves the existing stopped value instead of writing ended.
Keep the current writeState flow for other cases, and use the existing
readState/markStatus/FileStatePort.clear path to locate the fix.
---
Nitpick comments:
In `@src/agent/belt/act.ts`:
- Around line 48-76: The ActInputSchema in act.ts is too permissive because
z.object(...) drops unknown fields, so invalid MCP payloads can pass parsing and
fail later with misleading runtime errors. Make the boundary fail-fast by
switching ActInputSchema to a strict Zod object shape, keeping the same
action/target/text/key/direction/amount/networkIdle/timeout fields while
rejecting unknown operands like url at parse time.
In `@src/cli/control.test.ts`:
- Around line 59-69: Add a new stop-path test in control.test.ts that covers the
live PID branch in runStop: stub process.kill so the target PID appears alive,
then assert that the stop flow attempts SIGTERM and updates the persisted state
to stopped. Reuse the existing runStop, seedState, and readState helpers so the
new test sits alongside the current no-state and dead-server cases and validates
the alive-process behavior end to end.
In `@src/main.ts`:
- Around line 17-29: Move the subcommand routing logic out of main and keep
src/main.ts focused only on stdio MCP server startup. Extract the
init/status/stop dispatch currently in main() into a dedicated CLI entrypoint
and leave src/main.ts with just the server bootstrap path. Use the existing
main() function and runInit/runStatus/runStop symbols to relocate the
control-CLI handling without changing behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b8d94e92-3117-40eb-a8f8-98236c253e5e
⛔ Files ignored due to path filters (4)
dummy/web/bun.lockis excluded by!**/*.lockdummy/web/public/images/product-1.pngis excluded by!**/*.pngdummy/web/public/images/product-2.pngis excluded by!**/*.pngdummy/web/public/images/product-4.pngis excluded by!**/*.png
📒 Files selected for processing (59)
CLAUDE.mdREADME.mdbiome.jsondocs/claude/SKILL.mddocs/idea/adapters.mddocs/idea/agent-loop.mddocs/idea/architecture.mddocs/idea/config.mddocs/idea/desktop-control.mddocs/idea/mcp-tools.mddocs/idea/models.mddocs/idea/overview.mddocs/idea/workspace.mddummy/web/.ui-debugger-mcp.jsondummy/web/BUGS.mddummy/web/index.htmldummy/web/package.jsondummy/web/src/App.tsxdummy/web/src/components/Footer.module.scssdummy/web/src/components/Footer.tsxdummy/web/src/components/Header.module.scssdummy/web/src/components/Header.tsxdummy/web/src/components/Hero.module.scssdummy/web/src/components/Hero.tsxdummy/web/src/components/Newsletter.module.scssdummy/web/src/components/Newsletter.tsxdummy/web/src/components/ProductCard.module.scssdummy/web/src/components/ProductCard.tsxdummy/web/src/components/ProductGrid.module.scssdummy/web/src/components/ProductGrid.tsxdummy/web/src/main.tsxdummy/web/src/products.tsdummy/web/src/styles/global.scssdummy/web/src/styles/vars.scssdummy/web/src/vite-env.d.tsdummy/web/tsconfig.jsondummy/web/vite.config.tssrc/adapters/browser/browser-adapter.tssrc/adapters/browser/query.test.tssrc/adapters/browser/query.tssrc/agent/belt/act.test.tssrc/agent/belt/act.tssrc/agent/belt/observe.test.tssrc/agent/belt/observe.tssrc/agent/loop.test.tssrc/agent/loop.tssrc/agent/prompts/web-addendum.tssrc/cli/control.test.tssrc/cli/control.tssrc/config/load.tssrc/main.tssrc/mcp/tools/start-debug.tssrc/services/debug-service.test.tssrc/services/debug-service.tssrc/services/session-builder.tssrc/session/findings-store.test.tssrc/session/findings-store.tssrc/session/state-file.test.tssrc/session/state-file.ts
- query.ts: gate role= shorthand behind ARIA-role allow-list; narrow CSS combinator detection so labels like `Next >` / `A ~ B` stay text - observe.ts: omit `target` for scoped (within/filters) reads — act's unscoped replay could resolve a different element - act.ts: validate `text` before resolving the type target - control.ts: mark run stopped only after a verified SIGTERM; only the ESRCH race is benign - config/load.ts: surface invalid-config errors in loadWorkspaceDir, keep the missing-file fallback - start-debug.ts: cap timeout at 2_147_483s (Node timer ceiling) - session-builder.ts: await crash-path log writes so failures flush - state-file.ts: preserve terminal `stopped` over a later `ended` - web-addendum.ts + BUGS.md + README + SCSS comment/lint + docs Co-Authored-By: Claude <noreply@anthropic.com>
What & why
Driving the debug loop end-to-end against a real app (a planted-bug React fixture) surfaced blockers that unit tests couldn't — the inner agent never converged to a
report. This makes the fast-model driver (glm-5.2) actually work for web QA, and adds the run-control surface.Validated by a tool-scoped
claude -pblack-box QA run: a second Claude, allowed only themcp__ui-debugger__*tools (no source access), drove the debugger and returned a correct, prioritized fix plan — catching the dead buttons, no-op newsletter,/api/featured404 +App.tsx:31JSONSyntaxError, and the brokenlogo.png/product-3.png.Inner-loop fixes
act: flat input schema instead ofz.discriminatedUnion— theanyOfJSON-Schema made glm emit empty{}args, soactnever executed. Per-action requirements now enforced at runtime. (the make-or-break fix)normalizeQueryin the browser adapter maps LLM-natural targets (button "Add to cart", plain text) onto Playwrightrole=/text=engines, not just raw CSS.observe(tree)returns a readytargetper node (role+name or text,>> nth=for dupes) so the model copies a working selector — act success ~0% → ~100%.lookquestion hitENAMETOOLONGand killedlook.agent.logstep + tool-error logging.Run controls
start_debugtimeout — always capped (default 300s, overridable viatimeout); auto-ends and frees the profile.status/stopover a newstate.jsonbreadcrumb; SIGTERM/SIGINT end the run gracefully.Docs + fixtures
idea/→docs/idea/; adddocs/claude/SKILL.md(drivingclaudeheadless).dummy/from biome.dummy/web: deliberately-buggy React+SCSS QA fixture (answer key inBUGS.md).Known gap
The vision pass misses subtle pure-CSS issues (white-on-white invisible text, low-contrast) — a screenshot can't reveal invisible text; would need DOM computed-style contrast analysis (follow-up).
291 tests pass; typecheck + lint clean.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation