Skip to content

feat(cli): drive + agent + queue wait-for — automate boxes from a stateless CLI#10

Merged
madarco merged 3 commits into
mainfrom
feat/drive-agent-automation
May 27, 2026
Merged

feat(cli): drive + agent + queue wait-for — automate boxes from a stateless CLI#10
madarco merged 3 commits into
mainfrom
feat/drive-agent-automation

Conversation

@madarco
Copy link
Copy Markdown
Owner

@madarco madarco commented May 27, 2026

Summary

Three new command families so one Claude Code (or any script) can pilot another running inside an agentbox sandbox — like agent-browser / playwright-cli / the in-repo pnpm drive harness, but targeting the in-box tmux session and the in-box Claude Code state rather than a freshly spawned PTY.

  • agentbox drive {snapshot, keypress, send-text, prompt, wait, resize} — provider-uniform tmux capture-pane / send-keys / resize-window via Provider.exec. Zero in-box daemon code; the docker / daytona / hetzner providers all light up through the existing exec primitive. The keystroke DSL was promoted out of apps/cli/test/_harness/keys.ts to apps/cli/src/lib/drive/keys.ts so the runtime CLI and the test harness share one implementation.
  • agentbox agent {state, wait-for, get-plan-question} — surfaces Claude Code's plan-mode-end and AskUserQuestion prompts. New PreToolUse:ExitPlanMode and PreToolUse:AskUserQuestion matchers in the baked claude-managed-settings.json pipe the hook payload to agentbox-ctl claude-state --payload-stdin; the supervisor stores it; the host reads it back from ~/.agentbox/boxes/<id>/status.json. Sticky-state semantics in the reporter swallow the catchall PreToolUse working race AND preserve the plan/question payload through the AskUserQuestion-triggered Notification:permission_promptwaiting flicker. The matching PostToolUse hook (--clear-pending) is the only legitimate cleanup.
  • agentbox queue wait-for <event> — block on new-box | empty-queue | box-paused | box-running | box-stopped | job-done <id>. Polls state / queue manifests directly; no new relay endpoint.

All new commands default to human text and accept --json for automation consumers (matches the existing CLI convention).

The full design is in ~/.claude/plans/implement-automation-commands-in-harmonic-aho.md — written incrementally during plan mode and approved before implementation.

Test plan

  • pnpm -r typecheck clean
  • pnpm lint clean
  • pnpm -r test — 322 passing + 1 skipped (cloud-e2e); 26 new unit tests:
    • apps/cli/test/drive-keys.test.ts (8) — DSL round-trip table
    • apps/cli/test/drive-tmux.test.ts (17) — tmux argv shape via stub Provider.exec
    • apps/cli/test/agent-state.test.ts (15) — prompt-ready / waiting-flicker matchers
    • packages/ctl/test/status-reporter.test.ts (6) — sticky-state semantics, payload persistence through question → waiting, --clear-pending cleanup
  • Manual end-to-end on docker:
    • drive snapshot / keypress <C-c> / prompt / wait --text "391" round-trip against a running claude TUI
    • agent wait-for question matched in <1s after drive prompt-ing Claude to use AskUserQuestion; get-plan-question printed the full question + options while Claude was parked at the picker (state showed waiting underneath but the payload survived)
    • Plan-mode round-trip: <shift+tab> to plan mode, prompted for a plan, agent wait-for end-plan matched, get-plan-question printed the actual markdown plan body, Down+Enter to accept → state cleared via PostToolUse
    • queue wait-for empty-queue / box-running / box-stopped all matched correctly
  • Manual on daytona / hetzner — deferred; drive should work the same since Provider.exec is uniform; agent requires the new managed-settings to be baked into those providers' images (same claude-managed-settings.json, but each provider's prepare --provider needs to be re-run after merge)

Note

Medium Risk
New in-box tmux control and Claude hook/state paths affect automation reliability; changes are additive with unit tests but depend on relay/events and image-baked hooks for full agent behavior.

Overview
Adds CLI automation so scripts can operate sandboxes without a new daemon: agentbox drive drives in-box tmux (snapshot, keystroke DSL, send text, prompt+Enter, wait for screen text, resize) via Provider.exec; agentbox agent reads/waits on Claude Code activity (state, wait-for, get-plan-question) using box status and relay box-status events; agentbox queue wait-for blocks on queue/box lifecycle events via polling.

Ctl / hooks: extends Claude activity with end-plan / question, captures ExitPlanMode plan and AskUserQuestion payloads from hook stdin (--payload-stdin), and uses sticky reporter semantics so catchall working and waiting flickers do not drop pending plan/question data until PostToolUse clears with --clear-pending. Baked claude-managed-settings.json gains matched Pre/PostToolUse hooks for those tools.

Keystroke DSL moves from the test harness into lib/drive/keys.ts (shared with tests). New commands support --json for automation; help registers drive under Access and agent under Inspect.

Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.

…box from a stateless CLI

Add three command families so one Claude Code (or any script) can pilot
another running inside an agentbox sandbox:

- `agentbox drive {snapshot, keypress, send-text, prompt, wait, resize}` —
  provider-uniform tmux capture-pane / send-keys / resize-window via
  `Provider.exec`. Zero in-box daemon work; works on docker / daytona /
  hetzner identically. Reuses the keystroke DSL promoted out of the test
  harness (`apps/cli/test/_harness/keys.ts` now re-exports from
  `apps/cli/src/lib/drive/keys.ts`).

- `agentbox agent {state, wait-for, get-plan-question}` — surfaces Claude
  Code's plan-mode-end and AskUserQuestion prompts. New `PreToolUse`
  matchers in the baked managed-settings pipe the hook payload to
  `agentbox-ctl claude-state --payload-stdin`, the reporter stores it, and
  the host reads it back from `status.json`. Sticky-state semantics in the
  reporter swallow the catchall PreToolUse `working` race and preserve
  the plan/question payload through the AskUserQuestion-triggered
  `Notification:permission_prompt` → `waiting` flicker; the matching
  PostToolUse hook (`--clear-pending`) is the only legitimate cleanup.

- `agentbox queue wait-for <event>` — block on `new-box`, `empty-queue`,
  `box-paused`, `box-running`, `box-stopped`, or `job-done <id>`. Polls
  state/manifests directly; no new relay endpoint.

All new commands default to human text and accept `--json` for
automation consumers (matches existing CLI convention).

Verified end-to-end on a fresh docker box: drive snapshot/prompt/wait,
AskUserQuestion captured with full options payload while Claude was
parked at the picker, plan-mode round-trip with the actual plan body
read back via `agent get-plan-question`. 26 new unit tests; full
suite passes (322/323 + 1 skip).
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.

Comment thread apps/cli/src/lib/wait/events.ts Outdated
*/
async function currentHeadCursor(relayUrl: string, boxId: string | undefined): Promise<number> {
const events = await fetchEvents(relayUrl, 0, boxId);
return events.length > 0 ? events[events.length - 1]!.id : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition between fast-path check and event subscription

Medium Severity

currentHeadCursor fetches all buffered events (since=0) and returns the last event's id as the starting cursor, but never passes those events through the caller's predicate. In agent wait-for, a state transition that fires between the readBoxStatus fast-path check and the currentHeadCursor call will be included in the fetched batch, its id used as the cursor, and thus skipped — never evaluated by the predicate. The command then hangs until the next periodic status push (~15 seconds) emits a duplicate event.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.

madarco added 2 commits May 27, 2026 23:48
…eview)

Bugbot caught a real race in `apps/cli/src/lib/wait/events.ts`: between the
`readBoxStatus` fast-path read and the head-cursor capture, the relay could
broadcast a `box-status` event whose atomic `status.json` write was still
in flight. The cursor was advanced past that event without ever evaluating
the predicate, so `agent wait-for` would hang ~15s until the next periodic
status push duplicated the transition.

Fix: drop the standalone `currentHeadCursor` helper. The first sweep now
fetches `since=0`, evaluates the predicate against the LATEST event (so an
already-broadcast transition is caught even if `status.json` hasn't been
flushed yet), advances the cursor past it, then long-polls for further
transitions. Older buffered events are intentionally skipped so a stale
historical match doesn't trigger a false return — `wait-for` still means
"matches now or in the future", not "matches anywhere in the recent past".

Also fix the CI typecheck failure: `Parameters<typeof Class>` returns `never`
for a class constructor under strict TS — use `ConstructorParameters` instead.

3 new vitest cases in `apps/cli/test/wait-events.test.ts` cover the head
race, the stale-historical skip, and the timeout path.
The status-reporter test's `flushDebounce` slept a fixed 50ms after
`reporter.flush()`, betting the async snapshot would resolve in time.
The snapshot awaits three `probeAgentSession` calls — each spawns
`tmux has-session`. On CI runners with no tmux installed, ENOENT can
take 100ms+ to surface, so the assertion ran before the relay stub
recorded any post.

Poll until `relay.posted.length` grows (2s deadline) instead of guessing
a sleep duration. Same idea as the existing `waitFor` helper in
`packages/ctl/test/supervisor.test.ts`.
@madarco madarco merged commit 9f1b279 into main May 27, 2026
1 check passed
madarco added a commit that referenced this pull request May 28, 2026
prepare left the builder for Vercel's reaper out of caution that delete might
cascade to the snapshot. Verified live that it doesn't: a snapshot stays
status:created (256MB) and boots a fresh sandbox after its source is deleted.
prepare.ts now deletes the builder (step 8) best-effort, after the snapshot id
is persisted, so a delete failure leaves at most a lingering sandbox, never a
broken bake.

Closes backlog #10.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant