Implement: Helm base chart for the reference deployment (#171)#298
Merged
Conversation
Add a Helm chart (reference/helm/eden/) deploying the nine reference services on Kubernetes 1.27+, parallel to the Compose stack. StatefulSets for clone-holding services (orchestrator, executor/evaluator hosts, web-ui) + Postgres/Forgejo; Deployments for the stateless services. Bundles the orchestrator --max-quiescent-iterations 0 "never exit" sentinel (Decision 9) so the lease-driven orchestrator runs forever on restartPolicy: Always. Two-phase bootstrap via setup-experiment-helm.sh (infra → seed → app tier with the seed SHA), helm-lint + helm-smoke CI jobs, operator docs. Deferrals: helm-upgrade-smoke (#284), cross-service artifact serving (#285), required-status bump (#286). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…me derivation - Split postgres/forgejo pullPolicy from image.pullPolicy so kind-loaded app image (Never) doesn't block upstream dependency image pulls. - setup-experiment-helm.sh seeds the task-store orchestrators group with the orchestrator pod worker_ids (lease-driven mode self-joins only the control-plane group; #254 tracks the in-orchestrator fix). - Derive resource names from the helm fullname + honor secrets.existingSecret; parse image refs on the last colon (registry ports); propagate imagePullSecrets to the repo-init Job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…L dep, CI bucket - Seed task-store groups (incl. orchestrator pods) BEFORE registering the experiment with the control plane, so baseline creation isn't 403'd in the lease-acquisition window. - --set-string for image.repository/tag so numeric/bool-looking tags survive. - Pass the experiment config verbatim via --set-file experiment.configRaw (new chart value); drops the script's PyYAML dependency entirely (and the pyyaml/setup-python CI steps). - Add reference/compose/Dockerfile + .dockerignore to the helm changes bucket so helm-smoke runs when the image under test changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docs - Mount clone/artifacts PVCs via subPath so the clone target is an empty subdir — a fresh ext4 PVC root carries lost+found and git clone --bare refuses a non-empty destination (orchestrator/executor/evaluator/web-ui). - Derive a DNS-safe repo-init Job name (sanitize + sha1 suffix) so experiment ids with underscores/uppercase don't get rejected by kubectl. - values.schema.json caps replicas.webUi at 1 (operator-singleton in v0). - docs/deployment/helm.md: direct second experiments to a separate release/namespace instead of rewriting the release-wide experiment.id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rship gap) kubelet creates subPath dirs root-owned AFTER the fsGroup pass, so rootless pods can't write subPath mounts on a fresh PVC. Replace the round-3 subPath mounts: - Repo clones (orchestrator/executor/evaluator/web-ui): mount the PVC at a dedicated parent (/var/lib/eden/clone) and clone into the child repo dir (--repo-path .../clone/repo). fsGroup makes the volume root writable; git creates the empty child — solving both lost+found and the ownership gap with no initContainer. - web-ui artifacts: mount at the volume root (fsGroup-writable; stray lost+found is harmless to artifact serving). - Forgejo: persist only /var/lib/gitea (data PVC, no subPath); /etc/gitea is a fresh emptyDir regenerated from the FORGEJO__* env each boot (SECRET_KEY / INTERNAL_TOKEN come from the persistent Secret, so it stays consistent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Register config_uri as file:///etc/eden/experiment-config.yaml (the mounted config resource per chapter 11), not the Forgejo git remote. - docs/deployment/helm.md: reframe plain `helm install` as chart-resources-only (NOT a complete bootstrap) and point operators at setup-experiment-helm.sh. - eden.fullname truncates to 40 (not 63) to reserve headroom for the longest per-resource suffix (-git-credential-helper, 22 chars), keeping every name within the 63-char DNS label limit; the setup script mirrors the truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ating - setup-experiment-helm.sh preserves the existing baseCommitSha on rerun, so phase 1 no longer tears the live app tier down (repo-init is idempotent and phase 2 re-applies the same SHA). - Ideator host (a Deployment) uses a fixed config.ideatorWorkerId (ideator-1) instead of $(POD_NAME): Deployment pod names can exceed the 64-char worker_id grammar for long release names. Executor/evaluator StatefulSets keep their bounded POD_NAME ids. - Add tests/fixtures/** to the helm changes bucket so a fixture change runs the helm jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- repo-init Job seeds into /var/lib/eden/seed/repo (a child of the staging mount) so the repo dir is eden-owned and `git push` doesn't trip git's dubious-ownership check (fsGroup leaves the mount root root-owned). Mirrors the StatefulSet clone pattern. - values.schema.json caps replicas.ideatorHost at 1 (operator-singleton; fixed worker id), matching replicas.webUi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ap Job name
- Add checksum/secret + checksum/{experiment-config,git-credential-helper}
rollout annotations to all 9 pod templates, so a re-run that rotates a secret
or edits the config forces a rolling restart instead of leaving pods with
stale envFrom values.
- Drop nameOverride support from the helpers (the setup script derives fullname
in bash; a values-only override would silently diverge).
- Cap the repo-init Job name at 63 chars while preserving the hash suffix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Ideator Deployment uses strategy: Recreate so a rolling update doesn't run overlapping old+new pods sharing the one fixed --worker-id (credential race). - CHANGELOG: link the subprocess/DooD deferral to #291 and reframe the 13b-13e roadmap phases as pre-planned scope (not deferrals introduced by this chunk), satisfying the deferral-tracking discipline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ll hazard - ci-smoke.sh only deletes the kind cluster it created (refuses to reuse/delete a pre-existing cluster of the same name). - Document the chart-managed-Secret vs retained-PVC reinstall hazard (regenerated POSTGRES_PASSWORD can't auth against the retained data dir) in secret.yaml, the chart README, and docs/deployment/helm.md §5 — steer reinstallable deployments to secrets.existingSecret. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The setup script's fullname trim used s/-*$// (strips a hyphen run); Helm's trimSuffix "-" strips a single trailing hyphen. For custom release names whose first 40 chars end in multiple hyphens the script would target names Helm never rendered. Use s/-$// to match exactly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ale-up note - README: a pre-seeded plain `helm install` only satisfies the app-tier render gate — it still needs experiment registration + group bootstrap; point to setup-experiment-helm.sh. - Document that scaling replicas.orchestrator up requires re-running setup-experiment-helm.sh to add the new pod ids to the task-store orchestrators group (until the in-orchestrator self-join, #254); noted in docs/deployment/helm.md §5 and the script's success summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The out-of-tree repo-init Job now renders nodeSelector/tolerations/affinity from the release values (as JSON, which is valid YAML — keeps the script YAML-library-free), so the seed Job schedules on the same constrained/tainted nodes the chart pods target instead of going Pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… order Three fixes surfaced by rebasing the 13a branch onto current main and running the full validation suite (none were caught by the prior 13 codex rounds because they only emerged post-rebase / were never run): - loop.py: the Decision-9 docstring + inline sentinel comments pushed run_orchestrator_loop to 105 lines, tripping the complexity gate (>100). Extract a named `_quiescence_exit_reached(max, quiescent)` predicate and co-locate the sentinel docstring with it. No behavior change; the function is back under threshold and the sentinel logic is now self-documenting (no inline comments needed). - test_loop_unit.py: `test_loop_never_exits_on_quiescence_when_max_is_zero` used `updated_by="orchestrator"` / `terminated_by="orchestrator"`, which the #128 actor-id grammar (admin|wkr_*) — already in main when the branch was authored — rejects. The test was never green. Switch to `admin` (matching every other test in the file). Add a focused unit test for the extracted predicate. - CHANGELOG.md: the rebase placed the 13a entry below the #131 auto-checkpointing entry (with no separating blank line, a markdownlint MD032/MD022 failure). Move 13a to the top of [Unreleased] (it is the in-flight chunk per AGENTS.md) and restore the blank line after #131's deferral list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Phase 13a chart predated #128 (opaque-id minting) and forced lease-driven multi-experiment mode. This ports it to the canonical single-experiment path that compose-smoke validates: - setup-experiment-helm.sh mints the opaque exp_* id + reserved groups (admins/orchestrators) + one wkr_* worker per service via the task-store-server (POST {"name": ...}), capturing each minted worker_id + registration_token, mirroring the Compose setup-experiment.sh. Three-phase bring-up: infra -> task-store-server -> mint identities -> app tier. - The chart provisions each pod its minted --worker-id (was POD_NAME / literal ids) + a per-service identity Secret whose token an initContainer installs at /var/lib/eden/credentials/<worker_id>.token; the pod verifies it via /whoami. Each identity-consuming service gates on its identity.<svc>.workerId being set. - Lease mode is now opt-in: orchestrator.leaseMode.enabled (default false). Only then is the control-plane deployed, EDEN_CONTROL_PLANE_URL set, and the experiment registered. Multi-replica HA + opaque-id reconciliation deferred behind #281. replicas.orchestrator default 1. - Single-experiment mode now HONORS --max-quiescent-iterations 0 as a substrate override of the #157 config-wins behavior (else the orchestrator CrashLoopBackOffs on k8s). - ci-values/ci-smoke use a valid opaque exp_* id (exp-1 is rejected by the event model post-#128). helm-lint renders both the single-exp and lease-mode stacks. helm-smoke stays enabled against the default path. - README / docs/deployment/helm.md / CHANGELOG updated to the ported model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round on the ported chart surfaced that eden.identityEnabled gated an identity-consuming workload on workerId ALONE, while the per-service identity Secret renders only when BOTH workerId and token are set. A partial/manual values state (workerId set, token empty) would therefore render a pod referencing a Secret that never rendered. Tighten the gate to require both, matching the Secret's render condition, so the two stay consistent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round 2 converged (no smoke-blockers) but flagged: 1. web-ui credential path mismatch. web-ui resolves its per-experiment worker token through BearerCache at <credential-dir>/<experiment_id>/<worker_id>.token (distinct from the worker hosts' flat <dir>/<worker_id>.token), and the chart passed no --credential-dir, so it used the XDG default and never saw the provisioned token (surviving only via admin-reissue fallback). Fix: the identity initContainer now installs the web-ui token at the experiment-namespaced path (eden.identityTokenPath keys on the service), and the web-ui pod gets --credential-dir /var/lib/eden/credentials. Worker hosts keep the flat path. 2. values.yaml + schema described the identity gate as workerId-only; updated to "both workerId and token", and values.schema.json now constrains each serviceIdentity to both-empty-or-both-set (oneOf), so a half-set hand-written values file fails at lint/template time. 3. Extended values.schema.json to cover experiment.configRaw, postgres.*, forgejo.*, and top-level resources/nodeSelector/tolerations/affinity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
44e4c6f to
92ea7a9
Compare
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 13a: a Helm chart at
reference/helm/eden/that deploys the EDEN reference services on any conforming Kubernetes cluster (1.27+), parallel to the Compose stack. This PR ships the chart ported to the post-#128 opaque-id identity model, with the default being the single-experiment path thatcompose-smokevalidates.setup-experiment-helm.shmints the opaqueexp_*id + reservedadmins/orchestratorsgroups + onewkr_*worker per service against the task-store-server (POST {"name": ...}→worker_id+ one-timeregistration_token), mirroring the canonical Composesetup-experiment.sh. Each pod gets its minted--worker-idand a per-service identity Secret whose token an initContainer installs at the path the service's credential bootstrap reads (worker hosts:/var/lib/eden/credentials/<worker_id>.token; web-ui:.../<experiment_id>/<worker_id>.token), verified at startup via/whoami— no admin reissue. Three-phase bring-up (infra → task-store-server → mint identities → app tier) staged by two render gates (experiment.baseCommitSha,identity.<svc>.workerId+token).orchestrator.leaseMode.enabled, defaultfalse) and deferred behind Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281; the default deploys no control plane, sets noEDEN_CONTROL_PLANE_URL, and runs one orchestrator.--max-quiescent-iterations 0"never exit on quiescence" sentinel (KubernetesrestartPolicy: Always); single-experiment mode honors0as a substrate override of the Audit deployment CLI flags for promotion to experiment-config fields #157 config-wins behavior.codex:rescuerounds (record underdocs/plans/review/eden-phase-13a-helm-base-chart/impl/).Advances Phase 13a (first Kubernetes-substrate chunk). Spec ↔ impl: no spec change — the protocol is unchanged; only the deployment substrate is new. The one impl change (orchestrator quiescence sentinel) is a substrate accommodation, not a protocol change.
What this does NOT cover
helm-smoke.helm-upgrade-smokeCI job (plan §6.3) — deferred to helm-upgrade-smoke CI job (Phase 13a §6.3 deferral) #284.helm-lint/helm-smokerequired-status bump — deferred to Bump helm-lint + helm-smoke to required status checks #286 (newly-added-smoke-job convention: bump after ~2 weeks clean on main).--mode subprocess+ DooD worker hosts — 13a ships only--mode scripted; deferred to Kubernetes-native subprocess + DooD worker hosts #291.Fresh-operator walkthrough
This PR changes operator-facing surfaces (the chart templates,
setup-experiment-helm.sh, the web-ui--credential-dirflag).helm-smokeCI job: it spins up a kind cluster, builds + loads the image, runssetup-experiment-helm.shagainst the fixture exactly as a fresh operator would, and asserts the integration end-state (≥3variant.integrated, ≥9task.completed, ≥3 ideationtask.completed). The local equivalent isbash reference/helm/eden/ci-smoke.sh.helm-smokein CI. Locally validated:helm lint,helm templatefor both the single-experiment and lease-mode stacks (rendered manifests parse as valid Kubernetes objects), the schema fail-closed checks (empty image, half-set identity), and the per-service token-install paths against what each service's credential resolver reads. Watchinghelm-smokethrough to green is the gating walkthrough for this PR.Test plan
python3 scripts/check-rename-discipline.py— cleanpython3 scripts/check-complexity.py— 0 blockinguv run ruff check .— cleanuv run pyright— 0 errorssetup-experiment-helm.sh+ci-smoke.sh— cleanhelm lint reference/helm/eden -f reference/helm/eden/ci-values.yaml— passeshelm templatesingle-experiment + lease-mode → valid k8s; empty-image + half-set-identity correctly rejecteduv run pytest -q— 2245 passed, 249 skipped; 1 failure is the known sandboxos.killpgEPERM artifact (test_ideator_subprocess), unrelated to this changehelm-smokeCI job (the kind end-to-end) — gating; watching to greenRelated issues
🤖 Generated with Claude Code