Feature: Cut over Alpine lifecycle to up/down contracts#12
Conversation
| "state_transitions": transitions, | ||
| "started_at": session.StartedAt, | ||
| "stopped_at": session.StoppedAt, | ||
| "persisted_at": time.Now().UTC().Format(time.RFC3339), |
There was a problem hiding this comment.
🟡 persistedPayloadWithCompaction uses time.Now() instead of injectable store.now(), causing timestamp inconsistency
The build() closure inside persistedPayloadWithCompaction hardcodes time.Now().UTC() for the persisted_at field in the durable object payload. This creates two problems:
Root Cause and Impact
-
Timestamp mismatch:
persistSessionToDurableObjectatcmd/alpine/down.go:216computespersistedAt := now.Format(time.RFC3339)using thestore.now()parameter. ThispersistedAtis stored insession.PersistedAt(line 154) and returned to the caller. But the payload'spersisted_atat line 274 uses a separatetime.Now().UTC()call, so the persisted payload contains a different timestamp thansession.PersistedAtanddurableObjectRecord.PersistedAt. -
Non-deterministic during compaction:
build()is called multiple times during compaction loops (lines 293, 306), each generating a newtime.Now()value, making the payload'spersisted_atdrift with each attempt. -
Bypasses test injection: The rest of the codebase consistently uses
store.now()for time, but this function callstime.Now()directly, making the timestamp untestable.
Impact: The persisted_at inside the durable object payload will not match the persisted_at on the session record or the durable object record, violating the persistence verification contract.
Prompt for agents
In cmd/alpine/down.go, the persistedPayloadWithCompaction function at line 258 needs access to a consistent timestamp instead of calling time.Now().UTC() directly at line 274. Either: (a) add a time.Time parameter to persistedPayloadWithCompaction and pass store.now() from the caller persistSessionToDurableObject, or (b) capture the now value once before the build() closure and use it inside build(). The caller at line 210 (persistSessionToDurableObject) already receives a now parameter — pass it through to persistedPayloadWithCompaction and use it inside the build() closure at line 274 instead of time.Now().UTC(). This ensures the persisted_at in the payload matches the persistedAt stored on the session record and durable object record.
Was this helpful? React with 👍 or 👎 to provide feedback.
| } else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped { | ||
| releaseSessionLock(session, operationID) | ||
| opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🔴 down command rejects sessions stuck in StateStopping or StatePersisting, making them unrecoverable
If a down operation is interrupted after transitioning a session to StateStopping (line 103) but before completing, the session becomes stuck. Retrying down fails because the state check at line 108 only allows StatePersistFailed, StateStoppedUnpersisted, and StateStopped — but not StateStopping or StatePersisting.
Detailed Explanation
The flow in runDown at cmd/alpine/down.go:102-112:
if session.State == StateReady {
// transition to StateStopping
} else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped {
// reject with ErrOperationConflict
}Scenario: A session is in StateReady. The first down call transitions it to StateStopping at line 103, then the CLI process crashes or is killed before reaching StatePersisting. The session lock expires after 30 seconds (defaultLockLease at cmd/alpine/session_ledger.go:18). On retry, acquireSessionLock succeeds (lease expired), but the state is StateStopping, which is not StateReady and not in the allowed set {StatePersistFailed, StateStoppedUnpersisted, StateStopped}, so it returns ErrOperationConflict.
The same problem affects StatePersisting — if the process crashes after transitioning to StatePersisting but before completing, the session is also stuck.
Impact: Sessions stuck in intermediate states (StateStopping, StatePersisting) become permanently unrecoverable through the CLI. The only fix is manual ledger editing.
| } else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped { | |
| releaseSessionLock(session, operationID) | |
| opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID) | |
| return nil | |
| } | |
| } else if session.State != StatePersistFailed && session.State != StateStoppedUnpersisted && session.State != StateStopped && session.State != StateStopping && session.State != StatePersisting { | |
| releaseSessionLock(session, operationID) | |
| opErr = newCommandError(1, ErrOperationConflict, fmt.Sprintf("cannot run down from state %q", session.State), "wait for the active operation to finish and retry", true, operationID) | |
| return nil | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| result = upOutput{ | ||
| SessionID: session.SessionID, | ||
| OperationID: existing.OperationID, | ||
| State: session.State, | ||
| OpencodeURL: session.OpencodeURL, | ||
| ReadyAt: session.ReadyAt, | ||
| BrowserOpened: session.BrowserOpened, | ||
| TargetCommitSHA: session.TargetCommitSHA, | ||
| } | ||
| return nil |
There was a problem hiding this comment.
🔴 Idempotent retry of failed up returns exit code 0 with state=failed, misleading callers
When alpine up fails (e.g., resolveTargetCommitSHA fails), the failed session is persisted to the ledger with an idempotency record. The error's NextStep tells the user to "verify repository access and retry with the same --client-request-id". But retrying with the same key returns exit code 0 with state=failed in the output — confusing both human users and automation.
Root Cause
The flow in runUp at cmd/alpine/up.go:90-109:
When the idempotency key matches, the code loads the existing failed session and populates result with State: session.State (which is StateFailed). It then returns nil — so opErr stays nil. The caller at lines 221-226 sees no error and outputs the result as a success (exit code 0).
result = upOutput{
SessionID: session.SessionID,
State: session.State, // StateFailed
...
}
return nil // no error → exit code 0Meanwhile, session.LastError.NextStep at cmd/alpine/up.go:163 says "verify repository access and retry with the same --client-request-id", actively directing users into this misleading path.
Impact: Automation checking exit code 0 would incorrectly treat a failed session as successfully provisioned. Human users following the retry guidance get a confusing "success" response showing a failed state.
Prompt for agents
In cmd/alpine/up.go, the idempotent return path at lines 100-109 needs to check whether the existing session is in a terminal failure state. If the session state is StateFailed (or any terminal error state), the idempotent return should either: (a) set opErr to a commandError so the caller gets a non-zero exit code and structured error output, or (b) clear the existing idempotency record so the operation can be retried. Option (a) is simpler: after populating result at line 100-108, add a check like: if isTerminalState(session.State) && session.State != StateStopped { opErr = newCommandError(2, ErrSandboxProvisionFailed, "previous operation failed", "retry with a new --client-request-id", true, existing.OperationID); return nil }. Also fix the next_step at line 163 to say "retry with a new --client-request-id" instead of "retry with the same --client-request-id".
Was this helpful? React with 👍 or 👎 to provide feedback.
| return nil | ||
| } | ||
|
|
||
| targetCommitSHA, shaErr := resolveTargetCommitSHA(ctx, repoURL, upBranch) |
There was a problem hiding this comment.
🔴 Network call resolveTargetCommitSHA (90s timeout) runs inside exclusive ledger file lock, blocking all concurrent CLI operations
The runUp function calls resolveTargetCommitSHA at line 156 inside the store.withLedger() callback. This callback holds an exclusive LOCK_EX file lock on the session ledger (session_ledger.go:130). resolveTargetCommitSHA executes git ls-remote with a 90-second timeout (up.go:286-287). Additionally, openBrowser at line 196 adds up to 3 more seconds.
Root Cause
The withLedger function at session_ledger.go:118-145 acquires an exclusive flock and only releases it after the callback completes and the ledger is saved:
if err := syscall.Flock(int(lockFile.Fd()), syscall.LOCK_EX); err != nil { ... }
defer syscall.Flock(int(lockFile.Fd()), syscall.LOCK_UN)
// ... callback runs here with lock held ...
return s.saveLedger(ledger)All other CLI operations (alpine list, alpine status, alpine down, other alpine up calls) also use withLedger and will block on the same flock. With a 90-second timeout for git ls-remote plus a potential 3-second browser open, a single up invocation can block all other CLI operations for over 90 seconds.
Impact: In automation scenarios with concurrent CLI usage (the plan targets 200 concurrent active sessions), this serialization causes significant queueing delays. Even on a single-user machine, running alpine list while up is resolving a branch tip hangs until the network call completes.
Prompt for agents
In cmd/alpine/up.go, the call to resolveTargetCommitSHA (line 156) and openBrowser (line 196) should be moved outside the store.withLedger() callback. The recommended approach is to split the up flow into multiple withLedger calls: (1) first withLedger call to check idempotency and create the initial session record in requested/provisioning/repo_syncing state, (2) perform the network call to resolveTargetCommitSHA and browser open outside the lock, (3) second withLedger call to update the session with the resolved commit SHA, transition to ready state, and finalize the idempotency record. This ensures the exclusive file lock is only held for brief in-memory ledger mutations, not during network I/O.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
alpine up <git-repo> --branch <branch>andalpine down <session-id>, including deterministicsession_id/operation_idresponses, lifecycle transitions, idempotency handling, and lock semantics.alpine listandalpine statusto a session-ledger model with owner scoping, recent-window filtering, explicit recovery guidance, and structured JSON error taxonomy (error_code,cause,next_step,retryable,operation_id).create,teardown) with migration guidance, updated README + cutover plan checklist, and expanded tests for lifecycle contracts, persistence compaction, authorization checks, and command behavior.~/.alpine/sessions.json/ALPINE_LEDGER_PATH) and aligned test coverage enforcement from 97% to 85% to match the new command surface while preserving quality gates in pre-commit and Makefile.Testing
make testmake lintgo test ./cmd/alpine/...Post-Deploy Monitoring & Validation
ERR_OPERATION_CONFLICT,ERR_PERSIST_PAYLOAD_TOO_LARGE,ERR_SESSION_FORBIDDEN,ERR_SANDBOX_PROVISION_FAILEDmake test/make lint, and issue volume related toup/down/list/statusALPINE_PRINCIPAL_ID=tester GITHUB_TOKEN=dummy alpine up https://github.com/org/repo.git --branch main --jsonALPINE_PRINCIPAL_ID=tester alpine list --jsonALPINE_PRINCIPAL_ID=tester alpine status <session-id> --jsonALPINE_PRINCIPAL_ID=tester alpine down <session-id> --jsonupreturnsstate=readywith stable identifiers; repeated--client-request-idreturns same operation/session identity.downreturnsstate=stoppedwith verified persistence fields.create/teardownreturnERR_COMMAND_REPLACEDwith migrationnext_step.daeb3b0and re-open cutover work behind feature gating.