Skip to content

Feature: Replace Alpine with Cloudflare OpenCode Orchestrator#13

Closed
mkrueger12 wants to merge 4 commits into
mainfrom
feat/cloudflare-opencode-orchestrator
Closed

Feature: Replace Alpine with Cloudflare OpenCode Orchestrator#13
mkrueger12 wants to merge 4 commits into
mainfrom
feat/cloudflare-opencode-orchestrator

Conversation

@mkrueger12

@mkrueger12 mkrueger12 commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaced the primary CLI workflow with cloud-orchestrator commands: launch, list, status, open, export, and teardown, while preserving structured error contracts and JSON output.
  • Added a lifecycle/state orchestration layer with immutable sandbox identity, operation locks, checkpoint validation, resume behavior, export gating, and teardown safety semantics.
  • Migrated configuration/docs to a cloud-first schema (repo, sandbox, durability, github), updated examples/plan artifacts, and added broad test coverage to maintain the repository's 97% threshold.

Testing

  • make test
  • make lint
  • Added/updated command, config, lifecycle, orchestrator, and output contract tests across cmd/alpine/*_test.go

Post-Deploy Monitoring & Validation

  • What to monitor
    • Logs: CLI structured stderr output for reason codes (reason_code, retryable) on launch/export/open failures.
    • Metrics/Dashboards: N/A (no service deployment in this PR).
  • Validation checks:
    • make test
    • make lint
    • alpine launch demo --repo <repo> --json
    • alpine status demo --json
    • alpine export demo --json
    • alpine teardown demo --force --json
  • Expected healthy behavior
    • Lifecycle transitions are deterministic, retries are idempotent, and JSON contracts remain stable.
  • Failure signal(s) / rollback trigger
    • Unexpected exit code category changes, missing reason_code on retryable flows, or checkpoint validation not blocking unsafe resume/teardown.
  • Validation window and owner
    • Validation window: immediately after merge and during first operator workflow.
    • Owner: PR author.
  • If no runtime impact:
    • No additional operational monitoring required: this PR changes a local CLI binary and test/docs assets only (no production service rollout).

Open with Devin

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment thread cmd/alpine/config.go

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 YAML partial override of sandbox section silently zeroes auto_teardown default from true to false

When an alpine.yaml file specifies the sandbox: section with only some fields (e.g., just image_profile), yaml.Unmarshal replaces the entire SandboxConfig struct, zeroing the AutoTeardown field from its default true to false.

Root Cause and Impact

In cmd/alpine/config.go:57-96, loadConfig() initializes defaults including AutoTeardown: true, then calls yaml.Unmarshal(data, cfg). When the YAML contains a sandbox: key, the YAML library replaces the entire SandboxConfig struct value. Fields not present in the YAML get their zero values: AutoTeardown becomes false (bool zero), CompletedRetentionMins becomes 0 (int zero).

For example, this alpine.yaml:

sandbox:
  image_profile: heavy
  image_profiles:
    heavy: opencode-heavy
  web_base_url: https://sandbox.example.com

would silently set AutoTeardown = false and CompletedRetentionMins = 0, even though the user never explicitly disabled auto-teardown. The validation at cmd/alpine/config.go:124 only checks CompletedRetentionMins < 0, so 0 passes. There is no validation for AutoTeardown since false is a valid value.

Impact: Users who partially configure the sandbox section without explicitly setting auto_teardown: true will silently lose the auto-teardown behavior, which is the documented default. This is a surprising behavior change that could lead to sandbox resource leaks.

(Refers to lines 88-89)

Prompt for agents
In cmd/alpine/config.go, the loadConfig() function initializes defaults in the Config struct and then calls yaml.Unmarshal which replaces entire sub-structs when the corresponding YAML key is present. This causes boolean defaults like AutoTeardown (default true) to silently become false when the user only partially specifies the sandbox section. To fix this, either: (1) Apply defaults AFTER unmarshaling by checking zero values and setting defaults for fields like AutoTeardown and CompletedRetentionMins, or (2) Use a two-pass approach where you unmarshal into a separate struct with pointer fields to distinguish between 'not set' and 'set to zero/false', then merge with defaults. The simplest approach is option 1: after yaml.Unmarshal, re-apply defaults for fields that have zero values but should have non-zero defaults.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +365 to +371
rec.ExportLock = true
rec.OperationLock = o.newOperationLock("export")
rec.LastExportBranch = branch
rec.LastExportAt = now
rec.LastActivityAt = now
rec.UpdatedAt = now
rec.ExportLock = false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 ExportLock is set to true then immediately set to false before persisting, making it ineffective

In the export method, rec.ExportLock is set to true on line 365 and then immediately set back to false on line 371, before saveStore is called on line 385. The lock is never actually persisted to the state file.

Root Cause and Impact

The code at cmd/alpine/orchestrator.go:365-371:

rec.ExportLock = true
rec.OperationLock = o.newOperationLock("export")
rec.LastExportBranch = branch
rec.LastExportAt = now
rec.LastActivityAt = now
rec.UpdatedAt = now
rec.ExportLock = false

The ExportLock field is documented as a precondition for auto-teardown (the plan states: "Auto-teardown preconditions: completion signal detected, durable save verified, no active export lock"). Since the lock is never actually persisted as true, it cannot serve as a guard against concurrent teardown during export. The auto-teardown code at cmd/alpine/orchestrator.go:374 checks rec.State == stateCompleted && o.config.Sandbox.AutoTeardown but the export lock that should protect against premature teardown is always false.

Impact: The export lock mechanism is completely broken. Any concurrent operation or auto-teardown policy that checks ExportLock will never see it as true, defeating the purpose of the lock.

Prompt for agents
In cmd/alpine/orchestrator.go, the export method at lines 365-371 sets ExportLock = true and then immediately sets it back to false before saveStore is called. The lock should be set to true before the export operation begins, persisted, and then set to false after the export completes (or at least after the auto-teardown decision is made). Move the ExportLock = false assignment to after the auto-teardown logic block (after line 383), so the lock is actually effective during the export window. Alternatively, if the intent is to hold the lock only during the export operation, consider saving the store twice: once with ExportLock=true before the export, and once with ExportLock=false after.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

rec.OperationLock = o.newOperationLock("teardown")
rec.UpdatedAt = now
rec.LastActivityAt = now
rec.State = stateSaving

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Teardown bypasses lifecycle state machine validation when transitioning to stateSaving

The teardown method at line 431 directly sets rec.State = stateSaving without calling transitionLifecycle(), allowing invalid state transitions that violate the defined lifecycle state machine.

Detailed Explanation

The lifecycle state machine in cmd/alpine/state.go:18-55 defines valid transitions. Several states that can reach line 431 in teardown do NOT have stateSaving as a valid target:

  • stateCompletedstateSaving: not valid (only stateRunning, stateTearingDown, stateError are valid)
  • stateErrorstateSaving: not valid (only stateRunning is valid)
  • stateTearingDownstateSaving: not valid (only stateDestroyed, stateError are valid)

The guard at cmd/alpine/orchestrator.go:423 only requires --force for stateRunning, stateProvisioning, and stateNew. States like stateCompleted, stateError, and stateTearingDown pass through without any force check and hit the invalid transition at line 431.

For example, tearing down a stateCompleted sandbox (which doesn't require --force) sets it to stateSaving — a transition the state machine explicitly does not allow. This undermines the deterministic lifecycle guarantees described in the PR.

Impact: The lifecycle state machine is defined but not enforced during teardown, allowing sandboxes to enter states through transitions that the state machine considers invalid.

Prompt for agents
In cmd/alpine/orchestrator.go, the teardown method at line 431 sets rec.State = stateSaving without calling transitionLifecycle(). Either: (1) add transitionLifecycle validation before the state change, or (2) for states like stateCompleted that should skip the saving phase, transition directly to stateTearingDown instead of stateSaving. The valid transition table in cmd/alpine/state.go should also be updated if stateCompleted -> stateSaving is intended to be valid. The same pattern applies to stateError and stateTearingDown reaching this code path.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

- Add Worker scaffold under cloud/opencode-worker with Durable Object bindings
- Implement control plane endpoints for sandbox lifecycle (launch/list/status/open/teardown)
- Add auth sync script that extracts OpenAI-only credentials from local auth.json
- Wire CLI commands to control plane client when control_plane_url is configured
- Add Playwright smoke tests for local and live UI verification
- Add entrypoint script for auth materialization in sandbox runtime
- Update Makefile with worker-dev, worker-deploy, sync-auth, smoke-ui targets
- Document deploy vs launch separation in README operational runbook
- Add comprehensive test coverage for control plane client and integration

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

Comment on lines +36 to +42
chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE"
log "Wrote auth to $AUTH_DIR_USER/$AUTH_FILE"

mkdir -p "$AUTH_DIR_ROOT"
echo "$auth_json" > "$AUTH_DIR_ROOT/$AUTH_FILE"
chmod 600 "$AUTH_DIR_ROOT/$AUTH_FILE"
log "Wrote auth to $AUTH_DIR_ROOT/$AUTH_FILE (compat fallback)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Entrypoint runs as non-root user but performs root-only operations (chown, write to /root/)

The entrypoint.sh will crash on startup because the Dockerfile sets USER opencode at cloud/opencode-worker/Dockerfile:19, but the script calls chown and writes to /root/ which both require root privileges.

Root Cause and Impact

In setup_auth(), line 36 calls chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE" — a non-root user cannot chown files. Line 39 calls mkdir -p "$AUTH_DIR_ROOT" where AUTH_DIR_ROOT="/root/.local/share/opencode" — the opencode user has no write access to /root/. Lines 40-41 also write to /root/. Similarly, bootstrap_opencode_config() at line 74 calls chown opencode:opencode "$config_dir/config.json".

Since the script uses set -euo pipefail, the first failing command (chown at line 36) will cause the container to exit immediately with a non-zero status, preventing OpenCode from ever starting.

Impact: The sandbox container will fail to start every time, making the entire Worker + Sandbox deployment non-functional.

Prompt for agents
In cloud/opencode-worker/entrypoint.sh, the script runs as the non-root 'opencode' user (set by USER opencode in the Dockerfile at line 19), but performs several root-only operations:

1. Lines 36 and 74: chown calls - remove these since the file is already owned by the current user (opencode).
2. Lines 39-42: Writing to /root/.local/share/opencode/ - remove the entire root fallback block since the opencode user cannot write to /root/. If a root compat fallback is truly needed, the Dockerfile should not set USER opencode, or the entrypoint should be restructured to only write to the user's own home directory.

Specifically:
- Remove line 36: chown opencode:opencode "$AUTH_DIR_USER/$AUTH_FILE"
- Remove lines 38-42: the entire block that writes to AUTH_DIR_ROOT
- Remove line 74: chown opencode:opencode "$config_dir/config.json"

Alternatively, if root operations are needed, remove 'USER opencode' from the Dockerfile and handle user switching within the entrypoint using gosu or su-exec.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@mkrueger12 mkrueger12 closed this Feb 21, 2026
@mkrueger12 mkrueger12 deleted the feat/cloudflare-opencode-orchestrator branch February 21, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant