Implement: Helm base chart for the reference deployment (#171) by ealt · Pull Request #298 · ealt/eden

ealt · 2026-06-09T14:53:13Z

Summary

Phase 13a: a Helm chart at reference/helm/eden/ that deploys the EDEN reference services on any conforming Kubernetes cluster (1.27+), parallel to the Compose stack. This PR ships the chart ported to the post-#128 opaque-id identity model, with the default being the single-experiment path that compose-smoke validates.

Post-Disambiguate user-facing names from system-generated ids (experiments, workers, groups) #128 identity model. setup-experiment-helm.sh mints the opaque exp_* id + reserved admins/orchestrators groups + one wkr_* worker per service against the task-store-server (POST {"name": ...} → worker_id + one-time registration_token), mirroring the canonical Compose setup-experiment.sh. Each pod gets its minted --worker-id and a per-service identity Secret whose token an initContainer installs at the path the service's credential bootstrap reads (worker hosts: /var/lib/eden/credentials/<worker_id>.token; web-ui: .../<experiment_id>/<worker_id>.token), verified at startup via /whoami — no admin reissue. Three-phase bring-up (infra → task-store-server → mint identities → app tier) staged by two render gates (experiment.baseCommitSha, identity.<svc>.workerId+token).
Lease-driven HA is opt-in (orchestrator.leaseMode.enabled, default false) and deferred behind Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281; the default deploys no control plane, sets no EDEN_CONTROL_PLANE_URL, and runs one orchestrator.
Bundled orchestrator change (Decision 9): --max-quiescent-iterations 0 "never exit on quiescence" sentinel (Kubernetes restartPolicy: Always); single-experiment mode honors 0 as a substrate override of the Audit deployment CLI flags for promotion to experiment-config fields #157 config-wins behavior.
Reviewed to convergence over three codex:rescue rounds (record under docs/plans/review/eden-phase-13a-helm-base-chart/impl/).

Advances Phase 13a (first Kubernetes-substrate chunk). Spec ↔ impl: no spec change — the protocol is unchanged; only the deployment substrate is new. The one impl change (orchestrator quiescence sentinel) is a substrate accommodation, not a protocol change.

What this does NOT cover

Lease-driven HA (multi-replica orchestrators + control plane). Wired as an opt-in toggle but deferred + unvalidated behind Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281 (reconcile Disambiguate user-facing names from system-generated ids (experiments, workers, groups) #128 opaque-id minting with Backfill: compose-smoke-multi-experiment CI job (Phase 12c deferral) #147's lease-mode orchestrator self-registration). Re-enable once Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281 lands. Only the single-experiment default is exercised by helm-smoke.
helm-upgrade-smoke CI job (plan §6.3) — deferred to helm-upgrade-smoke CI job (Phase 13a §6.3 deferral) #284.
Cross-service artifact serving on Kubernetes (needs RWX / external blob backend) — deferred to Cross-service artifact serving on Kubernetes (RWX / blob backend) #285.
helm-lint/helm-smoke required-status bump — deferred to Bump helm-lint + helm-smoke to required status checks #286 (newly-added-smoke-job convention: bump after ~2 weeks clean on main).
Kubernetes-native --mode subprocess + DooD worker hosts — 13a ships only --mode scripted; deferred to Kubernetes-native subprocess + DooD worker hosts #291.
Subsequent Phase 13 substrate work (GPU executor Job 13b, managed Postgres 13c, S3/GCS 13d, Forgejo auth 13e) is pre-planned roadmap scope, not a gap.

Fresh-operator walkthrough

This PR changes operator-facing surfaces (the chart templates, setup-experiment-helm.sh, the web-ui --credential-dir flag).

The fresh-operator walkthrough is the helm-smoke CI job: it spins up a kind cluster, builds + loads the image, runs setup-experiment-helm.sh against the fixture exactly as a fresh operator would, and asserts the integration end-state (≥3 variant.integrated, ≥9 task.completed, ≥3 ideation task.completed). The local equivalent is bash reference/helm/eden/ci-smoke.sh.
Notes: A full local kind deploy was not run (kind not installed on the dev host); the operator path is validated by helm-smoke in CI. Locally validated: helm lint, helm template for both the single-experiment and lease-mode stacks (rendered manifests parse as valid Kubernetes objects), the schema fail-closed checks (empty image, half-set identity), and the per-service token-install paths against what each service's credential resolver reads. Watching helm-smoke through to green is the gating walkthrough for this PR.

Test plan

Related issues

Closes Phase 13a — Helm base chart for the reference deployment #171 — Phase 13a: base Helm chart for the reference services
Refs Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281 — lease-mode HA reconciliation (deferred; opt-in toggle wired)
Refs helm-upgrade-smoke CI job (Phase 13a §6.3 deferral) #284 / Cross-service artifact serving on Kubernetes (RWX / blob backend) #285 / Bump helm-lint + helm-smoke to required status checks #286 / Kubernetes-native subprocess + DooD worker hosts #291 — deferred follow-ups enumerated above
Triage: see docs/triage.md.

🤖 Generated with Claude Code

Add a Helm chart (reference/helm/eden/) deploying the nine reference services on Kubernetes 1.27+, parallel to the Compose stack. StatefulSets for clone-holding services (orchestrator, executor/evaluator hosts, web-ui) + Postgres/Forgejo; Deployments for the stateless services. Bundles the orchestrator --max-quiescent-iterations 0 "never exit" sentinel (Decision 9) so the lease-driven orchestrator runs forever on restartPolicy: Always. Two-phase bootstrap via setup-experiment-helm.sh (infra → seed → app tier with the seed SHA), helm-lint + helm-smoke CI jobs, operator docs. Deferrals: helm-upgrade-smoke (#284), cross-service artifact serving (#285), required-status bump (#286). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…me derivation - Split postgres/forgejo pullPolicy from image.pullPolicy so kind-loaded app image (Never) doesn't block upstream dependency image pulls. - setup-experiment-helm.sh seeds the task-store orchestrators group with the orchestrator pod worker_ids (lease-driven mode self-joins only the control-plane group; #254 tracks the in-orchestrator fix). - Derive resource names from the helm fullname + honor secrets.existingSecret; parse image refs on the last colon (registry ports); propagate imagePullSecrets to the repo-init Job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…L dep, CI bucket - Seed task-store groups (incl. orchestrator pods) BEFORE registering the experiment with the control plane, so baseline creation isn't 403'd in the lease-acquisition window. - --set-string for image.repository/tag so numeric/bool-looking tags survive. - Pass the experiment config verbatim via --set-file experiment.configRaw (new chart value); drops the script's PyYAML dependency entirely (and the pyyaml/setup-python CI steps). - Add reference/compose/Dockerfile + .dockerignore to the helm changes bucket so helm-smoke runs when the image under test changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…docs - Mount clone/artifacts PVCs via subPath so the clone target is an empty subdir — a fresh ext4 PVC root carries lost+found and git clone --bare refuses a non-empty destination (orchestrator/executor/evaluator/web-ui). - Derive a DNS-safe repo-init Job name (sanitize + sha1 suffix) so experiment ids with underscores/uppercase don't get rejected by kubectl. - values.schema.json caps replicas.webUi at 1 (operator-singleton in v0). - docs/deployment/helm.md: direct second experiments to a separate release/namespace instead of rewriting the release-wide experiment.id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rship gap) kubelet creates subPath dirs root-owned AFTER the fsGroup pass, so rootless pods can't write subPath mounts on a fresh PVC. Replace the round-3 subPath mounts: - Repo clones (orchestrator/executor/evaluator/web-ui): mount the PVC at a dedicated parent (/var/lib/eden/clone) and clone into the child repo dir (--repo-path .../clone/repo). fsGroup makes the volume root writable; git creates the empty child — solving both lost+found and the ownership gap with no initContainer. - web-ui artifacts: mount at the volume root (fsGroup-writable; stray lost+found is harmless to artifact serving). - Forgejo: persist only /var/lib/gitea (data PVC, no subPath); /etc/gitea is a fresh emptyDir regenerated from the FORGEJO__* env each boot (SECRET_KEY / INTERNAL_TOKEN come from the persistent Secret, so it stays consistent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Register config_uri as file:///etc/eden/experiment-config.yaml (the mounted config resource per chapter 11), not the Forgejo git remote. - docs/deployment/helm.md: reframe plain `helm install` as chart-resources-only (NOT a complete bootstrap) and point operators at setup-experiment-helm.sh. - eden.fullname truncates to 40 (not 63) to reserve headroom for the longest per-resource suffix (-git-credential-helper, 22 chars), keeping every name within the 63-char DNS label limit; the setup script mirrors the truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ating - setup-experiment-helm.sh preserves the existing baseCommitSha on rerun, so phase 1 no longer tears the live app tier down (repo-init is idempotent and phase 2 re-applies the same SHA). - Ideator host (a Deployment) uses a fixed config.ideatorWorkerId (ideator-1) instead of $(POD_NAME): Deployment pod names can exceed the 64-char worker_id grammar for long release names. Executor/evaluator StatefulSets keep their bounded POD_NAME ids. - Add tests/fixtures/** to the helm changes bucket so a fixture change runs the helm jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- repo-init Job seeds into /var/lib/eden/seed/repo (a child of the staging mount) so the repo dir is eden-owned and `git push` doesn't trip git's dubious-ownership check (fsGroup leaves the mount root root-owned). Mirrors the StatefulSet clone pattern. - values.schema.json caps replicas.ideatorHost at 1 (operator-singleton; fixed worker id), matching replicas.webUi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ap Job name - Add checksum/secret + checksum/{experiment-config,git-credential-helper} rollout annotations to all 9 pod templates, so a re-run that rotates a secret or edits the config forces a rolling restart instead of leaving pods with stale envFrom values. - Drop nameOverride support from the helpers (the setup script derives fullname in bash; a values-only override would silently diverge). - Cap the repo-init Job name at 63 chars while preserving the hash suffix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Ideator Deployment uses strategy: Recreate so a rolling update doesn't run overlapping old+new pods sharing the one fixed --worker-id (credential race). - CHANGELOG: link the subprocess/DooD deferral to #291 and reframe the 13b-13e roadmap phases as pre-planned scope (not deferrals introduced by this chunk), satisfying the deferral-tracking discipline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ll hazard - ci-smoke.sh only deletes the kind cluster it created (refuses to reuse/delete a pre-existing cluster of the same name). - Document the chart-managed-Secret vs retained-PVC reinstall hazard (regenerated POSTGRES_PASSWORD can't auth against the retained data dir) in secret.yaml, the chart README, and docs/deployment/helm.md §5 — steer reinstallable deployments to secrets.existingSecret. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The setup script's fullname trim used s/-*$// (strips a hyphen run); Helm's trimSuffix "-" strips a single trailing hyphen. For custom release names whose first 40 chars end in multiple hyphens the script would target names Helm never rendered. Use s/-$// to match exactly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ale-up note - README: a pre-seeded plain `helm install` only satisfies the app-tier render gate — it still needs experiment registration + group bootstrap; point to setup-experiment-helm.sh. - Document that scaling replicas.orchestrator up requires re-running setup-experiment-helm.sh to add the new pod ids to the task-store orchestrators group (until the in-orchestrator self-join, #254); noted in docs/deployment/helm.md §5 and the script's success summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The out-of-tree repo-init Job now renders nodeSelector/tolerations/affinity from the release values (as JSON, which is valid YAML — keeps the script YAML-library-free), so the seed Job schedules on the same constrained/tainted nodes the chart pods target instead of going Pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… order Three fixes surfaced by rebasing the 13a branch onto current main and running the full validation suite (none were caught by the prior 13 codex rounds because they only emerged post-rebase / were never run): - loop.py: the Decision-9 docstring + inline sentinel comments pushed run_orchestrator_loop to 105 lines, tripping the complexity gate (>100). Extract a named `_quiescence_exit_reached(max, quiescent)` predicate and co-locate the sentinel docstring with it. No behavior change; the function is back under threshold and the sentinel logic is now self-documenting (no inline comments needed). - test_loop_unit.py: `test_loop_never_exits_on_quiescence_when_max_is_zero` used `updated_by="orchestrator"` / `terminated_by="orchestrator"`, which the #128 actor-id grammar (admin|wkr_*) — already in main when the branch was authored — rejects. The test was never green. Switch to `admin` (matching every other test in the file). Add a focused unit test for the extracted predicate. - CHANGELOG.md: the rebase placed the 13a entry below the #131 auto-checkpointing entry (with no separating blank line, a markdownlint MD032/MD022 failure). Move 13a to the top of [Unreleased] (it is the in-flight chunk per AGENTS.md) and restore the blank line after #131's deferral list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Phase 13a chart predated #128 (opaque-id minting) and forced lease-driven multi-experiment mode. This ports it to the canonical single-experiment path that compose-smoke validates: - setup-experiment-helm.sh mints the opaque exp_* id + reserved groups (admins/orchestrators) + one wkr_* worker per service via the task-store-server (POST {"name": ...}), capturing each minted worker_id + registration_token, mirroring the Compose setup-experiment.sh. Three-phase bring-up: infra -> task-store-server -> mint identities -> app tier. - The chart provisions each pod its minted --worker-id (was POD_NAME / literal ids) + a per-service identity Secret whose token an initContainer installs at /var/lib/eden/credentials/<worker_id>.token; the pod verifies it via /whoami. Each identity-consuming service gates on its identity.<svc>.workerId being set. - Lease mode is now opt-in: orchestrator.leaseMode.enabled (default false). Only then is the control-plane deployed, EDEN_CONTROL_PLANE_URL set, and the experiment registered. Multi-replica HA + opaque-id reconciliation deferred behind #281. replicas.orchestrator default 1. - Single-experiment mode now HONORS --max-quiescent-iterations 0 as a substrate override of the #157 config-wins behavior (else the orchestrator CrashLoopBackOffs on k8s). - ci-values/ci-smoke use a valid opaque exp_* id (exp-1 is rejected by the event model post-#128). helm-lint renders both the single-exp and lease-mode stacks. helm-smoke stays enabled against the default path. - README / docs/deployment/helm.md / CHANGELOG updated to the ported model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Codex round on the ported chart surfaced that eden.identityEnabled gated an identity-consuming workload on workerId ALONE, while the per-service identity Secret renders only when BOTH workerId and token are set. A partial/manual values state (workerId set, token empty) would therefore render a pod referencing a Secret that never rendered. Tighten the gate to require both, matching the Secret's render condition, so the two stay consistent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Codex round 2 converged (no smoke-blockers) but flagged: 1. web-ui credential path mismatch. web-ui resolves its per-experiment worker token through BearerCache at <credential-dir>/<experiment_id>/<worker_id>.token (distinct from the worker hosts' flat <dir>/<worker_id>.token), and the chart passed no --credential-dir, so it used the XDG default and never saw the provisioned token (surviving only via admin-reissue fallback). Fix: the identity initContainer now installs the web-ui token at the experiment-namespaced path (eden.identityTokenPath keys on the service), and the web-ui pod gets --credential-dir /var/lib/eden/credentials. Worker hosts keep the flat path. 2. values.yaml + schema described the identity gate as workerId-only; updated to "both workerId and token", and values.schema.json now constrains each serviceIdentity to both-empty-or-both-set (oneOf), so a half-set hand-written values file fails at lint/template time. 3. Extended values.schema.json to cover experiment.configRaw, postgres.*, forgejo.*, and top-level resources/nodeSelector/tolerations/affinity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ealt enabled auto-merge (squash) June 9, 2026 14:53

ealt and others added 19 commits June 9, 2026 07:55

Record impl-stage codex rescue rounds for the #128 port

92ea7a9

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ealt force-pushed the impl/issue-171-helm-base-chart branch from 44e4c6f to 92ea7a9 Compare June 9, 2026 14:58

ealt merged commit 07482b6 into main Jun 9, 2026
23 checks passed

This was referenced Jun 10, 2026

Implement: Managed (external) Postgres mode for the Helm chart (#173) #305

Merged

Web-UI external access: Ingress + TLS for the Helm chart #308

Open

Idempotent AWS provisioning script for the EKS MVP (setup-aws.sh) #309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement: Helm base chart for the reference deployment (#171)#298

Implement: Helm base chart for the reference deployment (#171)#298
ealt merged 19 commits into
mainfrom
impl/issue-171-helm-base-chart

ealt commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ealt commented Jun 9, 2026

Summary

What this does NOT cover

Fresh-operator walkthrough

Test plan

Related issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant