feat(cli): `drive` + `agent` + `queue wait-for` — automate boxes from a stateless CLI by madarco · Pull Request #10 · madarco/agentbox

madarco · 2026-05-27T21:40:52Z

Summary

Three new command families so one Claude Code (or any script) can pilot another running inside an agentbox sandbox — like agent-browser / playwright-cli / the in-repo pnpm drive harness, but targeting the in-box tmux session and the in-box Claude Code state rather than a freshly spawned PTY.

agentbox drive {snapshot, keypress, send-text, prompt, wait, resize} — provider-uniform tmux capture-pane / send-keys / resize-window via Provider.exec. Zero in-box daemon code; the docker / daytona / hetzner providers all light up through the existing exec primitive. The keystroke DSL was promoted out of apps/cli/test/_harness/keys.ts to apps/cli/src/lib/drive/keys.ts so the runtime CLI and the test harness share one implementation.
agentbox agent {state, wait-for, get-plan-question} — surfaces Claude Code's plan-mode-end and AskUserQuestion prompts. New PreToolUse:ExitPlanMode and PreToolUse:AskUserQuestion matchers in the baked claude-managed-settings.json pipe the hook payload to agentbox-ctl claude-state --payload-stdin; the supervisor stores it; the host reads it back from ~/.agentbox/boxes/<id>/status.json. Sticky-state semantics in the reporter swallow the catchall PreToolUse working race AND preserve the plan/question payload through the AskUserQuestion-triggered Notification:permission_prompt → waiting flicker. The matching PostToolUse hook (--clear-pending) is the only legitimate cleanup.
agentbox queue wait-for <event> — block on new-box | empty-queue | box-paused | box-running | box-stopped | job-done <id>. Polls state / queue manifests directly; no new relay endpoint.

All new commands default to human text and accept --json for automation consumers (matches the existing CLI convention).

The full design is in ~/.claude/plans/implement-automation-commands-in-harmonic-aho.md — written incrementally during plan mode and approved before implementation.

Test plan

pnpm -r typecheck clean
pnpm lint clean
pnpm -r test — 322 passing + 1 skipped (cloud-e2e); 26 new unit tests:
- apps/cli/test/drive-keys.test.ts (8) — DSL round-trip table
- apps/cli/test/drive-tmux.test.ts (17) — tmux argv shape via stub Provider.exec
- apps/cli/test/agent-state.test.ts (15) — prompt-ready / waiting-flicker matchers
- packages/ctl/test/status-reporter.test.ts (6) — sticky-state semantics, payload persistence through question → waiting, --clear-pending cleanup
Manual end-to-end on docker:
- drive snapshot / keypress <C-c> / prompt / wait --text "391" round-trip against a running claude TUI
- agent wait-for question matched in <1s after drive prompt-ing Claude to use AskUserQuestion; get-plan-question printed the full question + options while Claude was parked at the picker (state showed waiting underneath but the payload survived)
- Plan-mode round-trip: <shift+tab> to plan mode, prompted for a plan, agent wait-for end-plan matched, get-plan-question printed the actual markdown plan body, Down+Enter to accept → state cleared via PostToolUse
- queue wait-for empty-queue / box-running / box-stopped all matched correctly
Manual on daytona / hetzner — deferred; drive should work the same since Provider.exec is uniform; agent requires the new managed-settings to be baked into those providers' images (same claude-managed-settings.json, but each provider's prepare --provider needs to be re-run after merge)

Note

Medium Risk
New in-box tmux control and Claude hook/state paths affect automation reliability; changes are additive with unit tests but depend on relay/events and image-baked hooks for full agent behavior.

Overview
Adds CLI automation so scripts can operate sandboxes without a new daemon: agentbox drive drives in-box tmux (snapshot, keystroke DSL, send text, prompt+Enter, wait for screen text, resize) via Provider.exec; agentbox agent reads/waits on Claude Code activity (state, wait-for, get-plan-question) using box status and relay box-status events; agentbox queue wait-for blocks on queue/box lifecycle events via polling.

Ctl / hooks: extends Claude activity with end-plan / question, captures ExitPlanMode plan and AskUserQuestion payloads from hook stdin (--payload-stdin), and uses sticky reporter semantics so catchall working and waiting flickers do not drop pending plan/question data until PostToolUse clears with --clear-pending. Baked claude-managed-settings.json gains matched Pre/PostToolUse hooks for those tools.

Keystroke DSL moves from the test harness into lib/drive/keys.ts (shared with tests). New commands support --json for automation; help registers drive under Access and agent under Inspect.

^{Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.}

…box from a stateless CLI Add three command families so one Claude Code (or any script) can pilot another running inside an agentbox sandbox: - `agentbox drive {snapshot, keypress, send-text, prompt, wait, resize}` — provider-uniform tmux capture-pane / send-keys / resize-window via `Provider.exec`. Zero in-box daemon work; works on docker / daytona / hetzner identically. Reuses the keystroke DSL promoted out of the test harness (`apps/cli/test/_harness/keys.ts` now re-exports from `apps/cli/src/lib/drive/keys.ts`). - `agentbox agent {state, wait-for, get-plan-question}` — surfaces Claude Code's plan-mode-end and AskUserQuestion prompts. New `PreToolUse` matchers in the baked managed-settings pipe the hook payload to `agentbox-ctl claude-state --payload-stdin`, the reporter stores it, and the host reads it back from `status.json`. Sticky-state semantics in the reporter swallow the catchall PreToolUse `working` race and preserve the plan/question payload through the AskUserQuestion-triggered `Notification:permission_prompt` → `waiting` flicker; the matching PostToolUse hook (`--clear-pending`) is the only legitimate cleanup. - `agentbox queue wait-for <event>` — block on `new-box`, `empty-queue`, `box-paused`, `box-running`, `box-stopped`, or `job-done <id>`. Polls state/manifests directly; no new relay endpoint. All new commands default to human text and accept `--json` for automation consumers (matches existing CLI convention). Verified end-to-end on a fresh docker box: drive snapshot/prompt/wait, AskUserQuestion captured with full options payload while Claude was parked at the picker, plan-mode round-trip with the actual plan body read back via `agent get-plan-question`. 26 new unit tests; full suite passes (322/323 + 1 skip).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.}

cursor · 2026-05-27T21:45:05Z

+ */
+async function currentHeadCursor(relayUrl: string, boxId: string | undefined): Promise<number> {
+  const events = await fetchEvents(relayUrl, 0, boxId);
+  return events.length > 0 ? events[events.length - 1]!.id : 0;


Race condition between fast-path check and event subscription

Medium Severity

currentHeadCursor fetches all buffered events (since=0) and returns the last event's id as the starting cursor, but never passes those events through the caller's predicate. In agent wait-for, a state transition that fires between the readBoxStatus fast-path check and the currentHeadCursor call will be included in the fetched batch, its id used as the cursor, and thus skipped — never evaluated by the predicate. The command then hangs until the next periodic status push (~15 seconds) emits a duplicate event.

Additional Locations (1)

apps/cli/src/commands/agent.ts#L75-L93

^{Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.}

…eview) Bugbot caught a real race in `apps/cli/src/lib/wait/events.ts`: between the `readBoxStatus` fast-path read and the head-cursor capture, the relay could broadcast a `box-status` event whose atomic `status.json` write was still in flight. The cursor was advanced past that event without ever evaluating the predicate, so `agent wait-for` would hang ~15s until the next periodic status push duplicated the transition. Fix: drop the standalone `currentHeadCursor` helper. The first sweep now fetches `since=0`, evaluates the predicate against the LATEST event (so an already-broadcast transition is caught even if `status.json` hasn't been flushed yet), advances the cursor past it, then long-polls for further transitions. Older buffered events are intentionally skipped so a stale historical match doesn't trigger a false return — `wait-for` still means "matches now or in the future", not "matches anywhere in the recent past". Also fix the CI typecheck failure: `Parameters<typeof Class>` returns `never` for a class constructor under strict TS — use `ConstructorParameters` instead. 3 new vitest cases in `apps/cli/test/wait-events.test.ts` cover the head race, the stale-historical skip, and the timeout path.

The status-reporter test's `flushDebounce` slept a fixed 50ms after `reporter.flush()`, betting the async snapshot would resolve in time. The snapshot awaits three `probeAgentSession` calls — each spawns `tmux has-session`. On CI runners with no tmux installed, ENOENT can take 100ms+ to surface, so the assertion ran before the relay stub recorded any post. Poll until `relay.posted.length` grows (2s deadline) instead of guessing a sleep duration. Same idea as the existing `waitFor` helper in `packages/ctl/test/supervisor.test.ts`.

prepare left the builder for Vercel's reaper out of caution that delete might cascade to the snapshot. Verified live that it doesn't: a snapshot stays status:created (256MB) and boots a fresh sandbox after its source is deleted. prepare.ts now deletes the builder (step 8) best-effort, after the snapshot id is persisted, so a delete failure leaves at most a lingering sandbox, never a broken bake. Closes backlog #10.

cursor Bot reviewed May 27, 2026

View reviewed changes

madarco added 2 commits May 27, 2026 23:48

madarco merged commit 9f1b279 into main May 27, 2026
1 check passed

madarco mentioned this pull request May 27, 2026

feat(state): complete activity-state coverage for Claude / Codex / OpenCode #11

Merged

6 tasks

madarco mentioned this pull request May 29, 2026

Vercel provider backlog: 9 items done + plans for the rest #22

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): `drive` + `agent` + `queue wait-for` — automate boxes from a stateless CLI#10

feat(cli): `drive` + `agent` + `queue wait-for` — automate boxes from a stateless CLI#10
madarco merged 3 commits into
mainfrom
feat/drive-agent-automation

madarco commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

madarco commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Race condition between fast-path check and event subscription

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

madarco commented May 27, 2026 •

edited by cursor Bot

Loading