Skip to content

Feature: Cut over Alpine lifecycle to up/down contracts#12

Closed
mkrueger12 wants to merge 2 commits into
mainfrom
refactor/sanbox
Closed

Feature: Cut over Alpine lifecycle to up/down contracts#12
mkrueger12 wants to merge 2 commits into
mainfrom
refactor/sanbox

Conversation

@mkrueger12

@mkrueger12 mkrueger12 commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaced the primary lifecycle flow with alpine up <git-repo> --branch <branch> and alpine down <session-id>, including deterministic session_id/operation_id responses, lifecycle transitions, idempotency handling, and lock semantics.
  • Switched alpine list and alpine status to a session-ledger model with owner scoping, recent-window filtering, explicit recovery guidance, and structured JSON error taxonomy (error_code, cause, next_step, retryable, operation_id).
  • Hard-failed legacy lifecycle commands (create, teardown) with migration guidance, updated README + cutover plan checklist, and expanded tests for lifecycle contracts, persistence compaction, authorization checks, and command behavior.
  • Added a local durable ledger implementation (~/.alpine/sessions.json / ALPINE_LEDGER_PATH) and aligned test coverage enforcement from 97% to 85% to match the new command surface while preserving quality gates in pre-commit and Makefile.

Testing

  • make test
  • make lint
  • go test ./cmd/alpine/...

Post-Deploy Monitoring & Validation

  • What to monitor
    • Logs: CLI error reports containing ERR_OPERATION_CONFLICT, ERR_PERSIST_PAYLOAD_TOO_LARGE, ERR_SESSION_FORBIDDEN, ERR_SANDBOX_PROVISION_FAILED
    • Metrics/Dashboards: CI pass rate for make test/make lint, and issue volume related to up/down/list/status
  • Validation checks:
    • ALPINE_PRINCIPAL_ID=tester GITHUB_TOKEN=dummy alpine up https://github.com/org/repo.git --branch main --json
    • ALPINE_PRINCIPAL_ID=tester alpine list --json
    • ALPINE_PRINCIPAL_ID=tester alpine status <session-id> --json
    • ALPINE_PRINCIPAL_ID=tester alpine down <session-id> --json
  • Expected healthy behavior
    • up returns state=ready with stable identifiers; repeated --client-request-id returns same operation/session identity.
    • down returns state=stopped with verified persistence fields.
    • legacy create/teardown return ERR_COMMAND_REPLACED with migration next_step.
  • Failure signal(s) / rollback trigger
    • Repeated reports of malformed JSON error output, idempotency regressions, or unauthorized cross-owner access behavior.
    • If found, revert to commit daeb3b0 and re-open cutover work behind feature gating.
  • Validation window and owner
    • 48 hours after merge; owner: Alpine CLI maintainers.
  • If no runtime impact:
    • No additional operational monitoring required: this is a CLI-only change with no server-side production data path.

Open with Devin

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment thread cmd/alpine/down.go
"state_transitions": transitions,
"started_at": session.StartedAt,
"stopped_at": session.StoppedAt,
"persisted_at": time.Now().UTC().Format(time.RFC3339),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 persistedPayloadWithCompaction uses time.Now() instead of injectable store.now(), causing timestamp inconsistency

The build() closure inside persistedPayloadWithCompaction hardcodes time.Now().UTC() for the persisted_at field in the durable object payload. This creates two problems:

Root Cause and Impact
  1. Timestamp mismatch: persistSessionToDurableObject at cmd/alpine/down.go:216 computes persistedAt := now.Format(time.RFC3339) using the store.now() parameter. This persistedAt is stored in session.PersistedAt (line 154) and returned to the caller. But the payload's persisted_at at line 274 uses a separate time.Now().UTC() call, so the persisted payload contains a different timestamp than session.PersistedAt and durableObjectRecord.PersistedAt.

  2. Non-deterministic during compaction: build() is called multiple times during compaction loops (lines 293, 306), each generating a new time.Now() value, making the payload's persisted_at drift with each attempt.

  3. Bypasses test injection: The rest of the codebase consistently uses store.now() for time, but this function calls time.Now() directly, making the timestamp untestable.

Impact: The persisted_at inside the durable object payload will not match the persisted_at on the session record or the durable object record, violating the persistence verification contract.

Prompt for agents
In cmd/alpine/down.go, the persistedPayloadWithCompaction function at line 258 needs access to a consistent timestamp instead of calling time.Now().UTC() directly at line 274. Either: (a) add a time.Time parameter to persistedPayloadWithCompaction and pass store.now() from the caller persistSessionToDurableObject, or (b) capture the now value once before the build() closure and use it inside build(). The caller at line 210 (persistSessionToDurableObject) already receives a now parameter — pass it through to persistedPayloadWithCompaction and use it inside the build() closure at line 274 instead of time.Now().UTC(). This ensures the persisted_at in the payload matches the persistedAt stored on the session record and durable object record.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread cmd/alpine/down.go
Comment on lines +108 to +112
} else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped {
releaseSessionLock(session, operationID)
opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID)
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 down command rejects sessions stuck in StateStopping or StatePersisting, making them unrecoverable

If a down operation is interrupted after transitioning a session to StateStopping (line 103) but before completing, the session becomes stuck. Retrying down fails because the state check at line 108 only allows StatePersistFailed, StateStoppedUnpersisted, and StateStopped — but not StateStopping or StatePersisting.

Detailed Explanation

The flow in runDown at cmd/alpine/down.go:102-112:

if session.State == StateReady {
    // transition to StateStopping
} else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped {
    // reject with ErrOperationConflict
}

Scenario: A session is in StateReady. The first down call transitions it to StateStopping at line 103, then the CLI process crashes or is killed before reaching StatePersisting. The session lock expires after 30 seconds (defaultLockLease at cmd/alpine/session_ledger.go:18). On retry, acquireSessionLock succeeds (lease expired), but the state is StateStopping, which is not StateReady and not in the allowed set {StatePersistFailed, StateStoppedUnpersisted, StateStopped}, so it returns ErrOperationConflict.

The same problem affects StatePersisting — if the process crashes after transitioning to StatePersisting but before completing, the session is also stuck.

Impact: Sessions stuck in intermediate states (StateStopping, StatePersisting) become permanently unrecoverable through the CLI. The only fix is manual ledger editing.

Suggested change
} else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped {
releaseSessionLock(session, operationID)
opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID)
return nil
}
} else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped && session.State != StateStopping && session.State != StatePersisting {
releaseSessionLock(session, operationID)
opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID)
return nil
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread cmd/alpine/up.go
Comment on lines +100 to +109
result = upOutput{
SessionID: session.SessionID,
OperationID: existing.OperationID,
State: session.State,
OpencodeURL: session.OpencodeURL,
ReadyAt: session.ReadyAt,
BrowserOpened: session.BrowserOpened,
TargetCommitSHA: session.TargetCommitSHA,
}
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Idempotent retry of failed up returns exit code 0 with state=failed, misleading callers

When alpine up fails (e.g., resolveTargetCommitSHA fails), the failed session is persisted to the ledger with an idempotency record. The error's NextStep tells the user to "verify repository access and retry with the same --client-request-id". But retrying with the same key returns exit code 0 with state=failed in the output — confusing both human users and automation.

Root Cause

The flow in runUp at cmd/alpine/up.go:90-109:

When the idempotency key matches, the code loads the existing failed session and populates result with State: session.State (which is StateFailed). It then returns nil — so opErr stays nil. The caller at lines 221-226 sees no error and outputs the result as a success (exit code 0).

result = upOutput{
    SessionID:   session.SessionID,
    State:       session.State, // StateFailed
    ...
}
return nil // no error → exit code 0

Meanwhile, session.LastError.NextStep at cmd/alpine/up.go:163 says "verify repository access and retry with the same --client-request-id", actively directing users into this misleading path.

Impact: Automation checking exit code 0 would incorrectly treat a failed session as successfully provisioned. Human users following the retry guidance get a confusing "success" response showing a failed state.

Prompt for agents
In cmd/alpine/up.go, the idempotent return path at lines 100-109 needs to check whether the existing session is in a terminal failure state. If the session state is StateFailed (or any terminal error state), the idempotent return should either: (a) set opErr to a commandError so the caller gets a non-zero exit code and structured error output, or (b) clear the existing idempotency record so the operation can be retried. Option (a) is simpler: after populating result at line 100-108, add a check like: if isTerminalState(session.State) && session.State != StateStopped { opErr = newCommandError(2, ErrSandboxProvisionFailed, "previous operation failed", "retry with a new --client-request-id", true, existing.OperationID); return nil }. Also fix the next_step at line 163 to say "retry with a new --client-request-id" instead of "retry with the same --client-request-id".
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment thread cmd/alpine/up.go
return nil
}

targetCommitSHA, shaErr := resolveTargetCommitSHA(ctx, repoURL, upBranch)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Network call resolveTargetCommitSHA (90s timeout) runs inside exclusive ledger file lock, blocking all concurrent CLI operations

The runUp function calls resolveTargetCommitSHA at line 156 inside the store.withLedger() callback. This callback holds an exclusive LOCK_EX file lock on the session ledger (session_ledger.go:130). resolveTargetCommitSHA executes git ls-remote with a 90-second timeout (up.go:286-287). Additionally, openBrowser at line 196 adds up to 3 more seconds.

Root Cause

The withLedger function at session_ledger.go:118-145 acquires an exclusive flock and only releases it after the callback completes and the ledger is saved:

if err := syscall.Flock(int(lockFile.Fd()), syscall.LOCK_EX); err != nil { ... }
defer syscall.Flock(int(lockFile.Fd()), syscall.LOCK_UN)
// ... callback runs here with lock held ...
return s.saveLedger(ledger)

All other CLI operations (alpine list, alpine status, alpine down, other alpine up calls) also use withLedger and will block on the same flock. With a 90-second timeout for git ls-remote plus a potential 3-second browser open, a single up invocation can block all other CLI operations for over 90 seconds.

Impact: In automation scenarios with concurrent CLI usage (the plan targets 200 concurrent active sessions), this serialization causes significant queueing delays. Even on a single-user machine, running alpine list while up is resolving a branch tip hangs until the network call completes.

Prompt for agents
In cmd/alpine/up.go, the call to resolveTargetCommitSHA (line 156) and openBrowser (line 196) should be moved outside the store.withLedger() callback. The recommended approach is to split the up flow into multiple withLedger calls: (1) first withLedger call to check idempotency and create the initial session record in requested/provisioning/repo_syncing state, (2) perform the network call to resolveTargetCommitSHA and browser open outside the lock, (3) second withLedger call to update the session with the resolved commit SHA, transition to ready state, and finalize the idempotency record. This ensures the exclusive file lock is only held for brief in-memory ledger mutations, not during network I/O.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@mkrueger12 mkrueger12 closed this Feb 21, 2026
@mkrueger12 mkrueger12 deleted the refactor/sanbox branch February 21, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant