feat(cli): drive + agent + queue wait-for — automate boxes from a stateless CLI#10
Conversation
…box from a stateless CLI
Add three command families so one Claude Code (or any script) can pilot
another running inside an agentbox sandbox:
- `agentbox drive {snapshot, keypress, send-text, prompt, wait, resize}` —
provider-uniform tmux capture-pane / send-keys / resize-window via
`Provider.exec`. Zero in-box daemon work; works on docker / daytona /
hetzner identically. Reuses the keystroke DSL promoted out of the test
harness (`apps/cli/test/_harness/keys.ts` now re-exports from
`apps/cli/src/lib/drive/keys.ts`).
- `agentbox agent {state, wait-for, get-plan-question}` — surfaces Claude
Code's plan-mode-end and AskUserQuestion prompts. New `PreToolUse`
matchers in the baked managed-settings pipe the hook payload to
`agentbox-ctl claude-state --payload-stdin`, the reporter stores it, and
the host reads it back from `status.json`. Sticky-state semantics in the
reporter swallow the catchall PreToolUse `working` race and preserve
the plan/question payload through the AskUserQuestion-triggered
`Notification:permission_prompt` → `waiting` flicker; the matching
PostToolUse hook (`--clear-pending`) is the only legitimate cleanup.
- `agentbox queue wait-for <event>` — block on `new-box`, `empty-queue`,
`box-paused`, `box-running`, `box-stopped`, or `job-done <id>`. Polls
state/manifests directly; no new relay endpoint.
All new commands default to human text and accept `--json` for
automation consumers (matches existing CLI convention).
Verified end-to-end on a fresh docker box: drive snapshot/prompt/wait,
AskUserQuestion captured with full options payload while Claude was
parked at the picker, plan-mode round-trip with the actual plan body
read back via `agent get-plan-question`. 26 new unit tests; full
suite passes (322/323 + 1 skip).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.
| */ | ||
| async function currentHeadCursor(relayUrl: string, boxId: string | undefined): Promise<number> { | ||
| const events = await fetchEvents(relayUrl, 0, boxId); | ||
| return events.length > 0 ? events[events.length - 1]!.id : 0; |
There was a problem hiding this comment.
Race condition between fast-path check and event subscription
Medium Severity
currentHeadCursor fetches all buffered events (since=0) and returns the last event's id as the starting cursor, but never passes those events through the caller's predicate. In agent wait-for, a state transition that fires between the readBoxStatus fast-path check and the currentHeadCursor call will be included in the fetched batch, its id used as the cursor, and thus skipped — never evaluated by the predicate. The command then hangs until the next periodic status push (~15 seconds) emits a duplicate event.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.
…eview) Bugbot caught a real race in `apps/cli/src/lib/wait/events.ts`: between the `readBoxStatus` fast-path read and the head-cursor capture, the relay could broadcast a `box-status` event whose atomic `status.json` write was still in flight. The cursor was advanced past that event without ever evaluating the predicate, so `agent wait-for` would hang ~15s until the next periodic status push duplicated the transition. Fix: drop the standalone `currentHeadCursor` helper. The first sweep now fetches `since=0`, evaluates the predicate against the LATEST event (so an already-broadcast transition is caught even if `status.json` hasn't been flushed yet), advances the cursor past it, then long-polls for further transitions. Older buffered events are intentionally skipped so a stale historical match doesn't trigger a false return — `wait-for` still means "matches now or in the future", not "matches anywhere in the recent past". Also fix the CI typecheck failure: `Parameters<typeof Class>` returns `never` for a class constructor under strict TS — use `ConstructorParameters` instead. 3 new vitest cases in `apps/cli/test/wait-events.test.ts` cover the head race, the stale-historical skip, and the timeout path.
The status-reporter test's `flushDebounce` slept a fixed 50ms after `reporter.flush()`, betting the async snapshot would resolve in time. The snapshot awaits three `probeAgentSession` calls — each spawns `tmux has-session`. On CI runners with no tmux installed, ENOENT can take 100ms+ to surface, so the assertion ran before the relay stub recorded any post. Poll until `relay.posted.length` grows (2s deadline) instead of guessing a sleep duration. Same idea as the existing `waitFor` helper in `packages/ctl/test/supervisor.test.ts`.
prepare left the builder for Vercel's reaper out of caution that delete might cascade to the snapshot. Verified live that it doesn't: a snapshot stays status:created (256MB) and boots a fresh sandbox after its source is deleted. prepare.ts now deletes the builder (step 8) best-effort, after the snapshot id is persisted, so a delete failure leaves at most a lingering sandbox, never a broken bake. Closes backlog #10.


Summary
Three new command families so one Claude Code (or any script) can pilot another running inside an agentbox sandbox — like
agent-browser/playwright-cli/ the in-repopnpm driveharness, but targeting the in-box tmux session and the in-box Claude Code state rather than a freshly spawned PTY.agentbox drive {snapshot, keypress, send-text, prompt, wait, resize}— provider-uniformtmux capture-pane/send-keys/resize-windowviaProvider.exec. Zero in-box daemon code; the docker / daytona / hetzner providers all light up through the existing exec primitive. The keystroke DSL was promoted out ofapps/cli/test/_harness/keys.tstoapps/cli/src/lib/drive/keys.tsso the runtime CLI and the test harness share one implementation.agentbox agent {state, wait-for, get-plan-question}— surfaces Claude Code's plan-mode-end and AskUserQuestion prompts. NewPreToolUse:ExitPlanModeandPreToolUse:AskUserQuestionmatchers in the bakedclaude-managed-settings.jsonpipe the hook payload toagentbox-ctl claude-state --payload-stdin; the supervisor stores it; the host reads it back from~/.agentbox/boxes/<id>/status.json. Sticky-state semantics in the reporter swallow the catchall PreToolUseworkingrace AND preserve the plan/question payload through the AskUserQuestion-triggeredNotification:permission_prompt→waitingflicker. The matchingPostToolUsehook (--clear-pending) is the only legitimate cleanup.agentbox queue wait-for <event>— block onnew-box | empty-queue | box-paused | box-running | box-stopped | job-done <id>. Polls state / queue manifests directly; no new relay endpoint.All new commands default to human text and accept
--jsonfor automation consumers (matches the existing CLI convention).The full design is in
~/.claude/plans/implement-automation-commands-in-harmonic-aho.md— written incrementally during plan mode and approved before implementation.Test plan
pnpm -r typecheckcleanpnpm lintcleanpnpm -r test— 322 passing + 1 skipped (cloud-e2e); 26 new unit tests:apps/cli/test/drive-keys.test.ts(8) — DSL round-trip tableapps/cli/test/drive-tmux.test.ts(17) — tmux argv shape via stubProvider.execapps/cli/test/agent-state.test.ts(15) — prompt-ready / waiting-flicker matcherspackages/ctl/test/status-reporter.test.ts(6) — sticky-state semantics, payload persistence throughquestion → waiting,--clear-pendingcleanupdrive snapshot/keypress <C-c>/prompt/wait --text "391"round-trip against a running claude TUIagent wait-for questionmatched in <1s afterdrive prompt-ing Claude to useAskUserQuestion;get-plan-questionprinted the full question + options while Claude was parked at the picker (state showedwaitingunderneath but the payload survived)<shift+tab>to plan mode, prompted for a plan,agent wait-for end-planmatched,get-plan-questionprinted the actual markdown plan body, Down+Enter to accept → state cleared via PostToolUsequeue wait-for empty-queue / box-running / box-stoppedall matched correctlydriveshould work the same sinceProvider.execis uniform;agentrequires the new managed-settings to be baked into those providers' images (sameclaude-managed-settings.json, but each provider'sprepare --providerneeds to be re-run after merge)Note
Medium Risk
New in-box tmux control and Claude hook/state paths affect automation reliability; changes are additive with unit tests but depend on relay/events and image-baked hooks for full
agentbehavior.Overview
Adds CLI automation so scripts can operate sandboxes without a new daemon:
agentbox drivedrives in-box tmux (snapshot, keystroke DSL, send text, prompt+Enter, wait for screen text, resize) viaProvider.exec;agentbox agentreads/waits on Claude Code activity (state,wait-for,get-plan-question) using box status and relaybox-statusevents;agentbox queue wait-forblocks on queue/box lifecycle events via polling.Ctl / hooks: extends Claude activity with
end-plan/question, captures ExitPlanMode plan and AskUserQuestion payloads from hook stdin (--payload-stdin), and uses sticky reporter semantics so catchallworkingandwaitingflickers do not drop pending plan/question data untilPostToolUseclears with--clear-pending. Bakedclaude-managed-settings.jsongains matched Pre/PostToolUse hooks for those tools.Keystroke DSL moves from the test harness into
lib/drive/keys.ts(shared with tests). New commands support--jsonfor automation; help registersdriveunder Access andagentunder Inspect.Reviewed by Cursor Bugbot for commit 5c6b446. Configure here.