Feature: Replace Alpine with Cloudflare OpenCode Orchestrator#13
Feature: Replace Alpine with Cloudflare OpenCode Orchestrator#13mkrueger12 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
🔴 YAML partial override of sandbox section silently zeroes auto_teardown default from true to false
When an alpine.yaml file specifies the sandbox: section with only some fields (e.g., just image_profile), yaml.Unmarshal replaces the entire SandboxConfig struct, zeroing the AutoTeardown field from its default true to false.
Root Cause and Impact
In cmd/alpine/config.go:57-96, loadConfig() initializes defaults including AutoTeardown: true, then calls yaml.Unmarshal(data, cfg). When the YAML contains a sandbox: key, the YAML library replaces the entire SandboxConfig struct value. Fields not present in the YAML get their zero values: AutoTeardown becomes false (bool zero), CompletedRetentionMins becomes 0 (int zero).
For example, this alpine.yaml:
sandbox:
image_profile: heavy
image_profiles:
heavy: opencode-heavy
web_base_url: https://sandbox.example.comwould silently set AutoTeardown = false and CompletedRetentionMins = 0, even though the user never explicitly disabled auto-teardown. The validation at cmd/alpine/config.go:124 only checks CompletedRetentionMins < 0, so 0 passes. There is no validation for AutoTeardown since false is a valid value.
Impact: Users who partially configure the sandbox section without explicitly setting auto_teardown: true will silently lose the auto-teardown behavior, which is the documented default. This is a surprising behavior change that could lead to sandbox resource leaks.
(Refers to lines 88-89)
Prompt for agents
In cmd/alpine/config.go, the loadConfig() function initializes defaults in the Config struct and then calls yaml.Unmarshal which replaces entire sub-structs when the corresponding YAML key is present. This causes boolean defaults like AutoTeardown (default true) to silently become false when the user only partially specifies the sandbox section. To fix this, either: (1) Apply defaults AFTER unmarshaling by checking zero values and setting defaults for fields like AutoTeardown and CompletedRetentionMins, or (2) Use a two-pass approach where you unmarshal into a separate struct with pointer fields to distinguish between 'not set' and 'set to zero/false', then merge with defaults. The simplest approach is option 1: after yaml.Unmarshal, re-apply defaults for fields that have zero values but should have non-zero defaults.
Was this helpful? React with 👍 or 👎 to provide feedback.
| rec.ExportLock = true | ||
| rec.OperationLock = o.newOperationLock("export") | ||
| rec.LastExportBranch = branch | ||
| rec.LastExportAt = now | ||
| rec.LastActivityAt = now | ||
| rec.UpdatedAt = now | ||
| rec.ExportLock = false |
There was a problem hiding this comment.
🔴 ExportLock is set to true then immediately set to false before persisting, making it ineffective
In the export method, rec.ExportLock is set to true on line 365 and then immediately set back to false on line 371, before saveStore is called on line 385. The lock is never actually persisted to the state file.
Root Cause and Impact
The code at cmd/alpine/orchestrator.go:365-371:
rec.ExportLock = true
rec.OperationLock = o.newOperationLock("export")
rec.LastExportBranch = branch
rec.LastExportAt = now
rec.LastActivityAt = now
rec.UpdatedAt = now
rec.ExportLock = falseThe ExportLock field is documented as a precondition for auto-teardown (the plan states: "Auto-teardown preconditions: completion signal detected, durable save verified, no active export lock"). Since the lock is never actually persisted as true, it cannot serve as a guard against concurrent teardown during export. The auto-teardown code at cmd/alpine/orchestrator.go:374 checks rec.State == stateCompleted && o.config.Sandbox.AutoTeardown but the export lock that should protect against premature teardown is always false.
Impact: The export lock mechanism is completely broken. Any concurrent operation or auto-teardown policy that checks ExportLock will never see it as true, defeating the purpose of the lock.
Prompt for agents
In cmd/alpine/orchestrator.go, the export method at lines 365-371 sets ExportLock = true and then immediately sets it back to false before saveStore is called. The lock should be set to true before the export operation begins, persisted, and then set to false after the export completes (or at least after the auto-teardown decision is made). Move the ExportLock = false assignment to after the auto-teardown logic block (after line 383), so the lock is actually effective during the export window. Alternatively, if the intent is to hold the lock only during the export operation, consider saving the store twice: once with ExportLock=true before the export, and once with ExportLock=false after.
Was this helpful? React with 👍 or 👎 to provide feedback.
| rec.OperationLock = o.newOperationLock("teardown") | ||
| rec.UpdatedAt = now | ||
| rec.LastActivityAt = now | ||
| rec.State = stateSaving |
There was a problem hiding this comment.
🟡 Teardown bypasses lifecycle state machine validation when transitioning to stateSaving
The teardown method at line 431 directly sets rec.State = stateSaving without calling transitionLifecycle(), allowing invalid state transitions that violate the defined lifecycle state machine.
Detailed Explanation
The lifecycle state machine in cmd/alpine/state.go:18-55 defines valid transitions. Several states that can reach line 431 in teardown do NOT have stateSaving as a valid target:
stateCompleted→stateSaving: not valid (onlystateRunning,stateTearingDown,stateErrorare valid)stateError→stateSaving: not valid (onlystateRunningis valid)stateTearingDown→stateSaving: not valid (onlystateDestroyed,stateErrorare valid)
The guard at cmd/alpine/orchestrator.go:423 only requires --force for stateRunning, stateProvisioning, and stateNew. States like stateCompleted, stateError, and stateTearingDown pass through without any force check and hit the invalid transition at line 431.
For example, tearing down a stateCompleted sandbox (which doesn't require --force) sets it to stateSaving — a transition the state machine explicitly does not allow. This undermines the deterministic lifecycle guarantees described in the PR.
Impact: The lifecycle state machine is defined but not enforced during teardown, allowing sandboxes to enter states through transitions that the state machine considers invalid.
Prompt for agents
In cmd/alpine/orchestrator.go, the teardown method at line 431 sets rec.State = stateSaving without calling transitionLifecycle(). Either: (1) add transitionLifecycle validation before the state change, or (2) for states like stateCompleted that should skip the saving phase, transition directly to stateTearingDown instead of stateSaving. The valid transition table in cmd/alpine/state.go should also be updated if stateCompleted -> stateSaving is intended to be valid. The same pattern applies to stateError and stateTearingDown reaching this code path.
Was this helpful? React with 👍 or 👎 to provide feedback.
- Add Worker scaffold under cloud/opencode-worker with Durable Object bindings - Implement control plane endpoints for sandbox lifecycle (launch/list/status/open/teardown) - Add auth sync script that extracts OpenAI-only credentials from local auth.json - Wire CLI commands to control plane client when control_plane_url is configured - Add Playwright smoke tests for local and live UI verification - Add entrypoint script for auth materialization in sandbox runtime - Update Makefile with worker-dev, worker-deploy, sync-auth, smoke-ui targets - Document deploy vs launch separation in README operational runbook - Add comprehensive test coverage for control plane client and integration
| chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE" | ||
| log "Wrote auth to $AUTH_DIR_USER/$AUTH_FILE" | ||
|
|
||
| mkdir -p "$AUTH_DIR_ROOT" | ||
| echo "$auth_json" > "$AUTH_DIR_ROOT/$AUTH_FILE" | ||
| chmod 600 "$AUTH_DIR_ROOT/$AUTH_FILE" | ||
| log "Wrote auth to $AUTH_DIR_ROOT/$AUTH_FILE (compat fallback)" |
There was a problem hiding this comment.
🔴 Entrypoint runs as non-root user but performs root-only operations (chown, write to /root/)
The entrypoint.sh will crash on startup because the Dockerfile sets USER opencode at cloud/opencode-worker/Dockerfile:19, but the script calls chown and writes to /root/ which both require root privileges.
Root Cause and Impact
In setup_auth(), line 36 calls chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE" — a non-root user cannot chown files. Line 39 calls mkdir -p "$AUTH_DIR_ROOT" where AUTH_DIR_ROOT="/root/.local/share/opencode" — the opencode user has no write access to /root/. Lines 40-41 also write to /root/. Similarly, bootstrap_opencode_config() at line 74 calls chown opencode:opencode "$config_dir/config.json".
Since the script uses set -euo pipefail, the first failing command (chown at line 36) will cause the container to exit immediately with a non-zero status, preventing OpenCode from ever starting.
Impact: The sandbox container will fail to start every time, making the entire Worker + Sandbox deployment non-functional.
Prompt for agents
In cloud/opencode-worker/entrypoint.sh, the script runs as the non-root 'opencode' user (set by USER opencode in the Dockerfile at line 19), but performs several root-only operations:
1. Lines 36 and 74: chown calls - remove these since the file is already owned by the current user (opencode).
2. Lines 39-42: Writing to /root/.local/share/opencode/ - remove the entire root fallback block since the opencode user cannot write to /root/. If a root compat fallback is truly needed, the Dockerfile should not set USER opencode, or the entrypoint should be restructured to only write to the user's own home directory.
Specifically:
- Remove line 36: chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE"
- Remove lines 38-42: the entire block that writes to AUTH_DIR_ROOT
- Remove line 74: chown opencode:opencode "$config_dir/config.json"
Alternatively, if root operations are needed, remove 'USER opencode' from the Dockerfile and handle user switching within the entrypoint using gosu or su-exec.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
launch,list,status,open,export, andteardown, while preserving structured error contracts and JSON output.repo,sandbox,durability,github), updated examples/plan artifacts, and added broad test coverage to maintain the repository's 97% threshold.Testing
make testmake lintcmd/alpine/*_test.goPost-Deploy Monitoring & Validation
reason_code,retryable) on launch/export/open failures.make testmake lintalpine launch demo --repo <repo> --jsonalpine status demo --jsonalpine export demo --jsonalpine teardown demo --force --jsonreason_codeon retryable flows, or checkpoint validation not blocking unsafe resume/teardown.