Skip to content

Implement: Helm base chart for the reference deployment (#171)#298

Merged
ealt merged 19 commits into
mainfrom
impl/issue-171-helm-base-chart
Jun 9, 2026
Merged

Implement: Helm base chart for the reference deployment (#171)#298
ealt merged 19 commits into
mainfrom
impl/issue-171-helm-base-chart

Conversation

@ealt

@ealt ealt commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 13a: a Helm chart at reference/helm/eden/ that deploys the EDEN reference services on any conforming Kubernetes cluster (1.27+), parallel to the Compose stack. This PR ships the chart ported to the post-#128 opaque-id identity model, with the default being the single-experiment path that compose-smoke validates.

  • Post-Disambiguate user-facing names from system-generated ids (experiments, workers, groups) #128 identity model. setup-experiment-helm.sh mints the opaque exp_* id + reserved admins/orchestrators groups + one wkr_* worker per service against the task-store-server (POST {"name": ...}worker_id + one-time registration_token), mirroring the canonical Compose setup-experiment.sh. Each pod gets its minted --worker-id and a per-service identity Secret whose token an initContainer installs at the path the service's credential bootstrap reads (worker hosts: /var/lib/eden/credentials/<worker_id>.token; web-ui: .../<experiment_id>/<worker_id>.token), verified at startup via /whoami — no admin reissue. Three-phase bring-up (infra → task-store-server → mint identities → app tier) staged by two render gates (experiment.baseCommitSha, identity.<svc>.workerId+token).
  • Lease-driven HA is opt-in (orchestrator.leaseMode.enabled, default false) and deferred behind Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281; the default deploys no control plane, sets no EDEN_CONTROL_PLANE_URL, and runs one orchestrator.
  • Bundled orchestrator change (Decision 9): --max-quiescent-iterations 0 "never exit on quiescence" sentinel (Kubernetes restartPolicy: Always); single-experiment mode honors 0 as a substrate override of the Audit deployment CLI flags for promotion to experiment-config fields #157 config-wins behavior.
  • Reviewed to convergence over three codex:rescue rounds (record under docs/plans/review/eden-phase-13a-helm-base-chart/impl/).

Advances Phase 13a (first Kubernetes-substrate chunk). Spec ↔ impl: no spec change — the protocol is unchanged; only the deployment substrate is new. The one impl change (orchestrator quiescence sentinel) is a substrate accommodation, not a protocol change.

What this does NOT cover

Fresh-operator walkthrough

This PR changes operator-facing surfaces (the chart templates, setup-experiment-helm.sh, the web-ui --credential-dir flag).

  • The fresh-operator walkthrough is the helm-smoke CI job: it spins up a kind cluster, builds + loads the image, runs setup-experiment-helm.sh against the fixture exactly as a fresh operator would, and asserts the integration end-state (≥3 variant.integrated, ≥9 task.completed, ≥3 ideation task.completed). The local equivalent is bash reference/helm/eden/ci-smoke.sh.
  • Notes: A full local kind deploy was not run (kind not installed on the dev host); the operator path is validated by helm-smoke in CI. Locally validated: helm lint, helm template for both the single-experiment and lease-mode stacks (rendered manifests parse as valid Kubernetes objects), the schema fail-closed checks (empty image, half-set identity), and the per-service token-install paths against what each service's credential resolver reads. Watching helm-smoke through to green is the gating walkthrough for this PR.

Test plan

  • python3 scripts/check-rename-discipline.py — clean
  • python3 scripts/check-complexity.py — 0 blocking
  • uv run ruff check . — clean
  • uv run pyright — 0 errors
  • markdownlint (all tracked md) — 0 errors
  • shellcheck setup-experiment-helm.sh + ci-smoke.sh — clean
  • helm lint reference/helm/eden -f reference/helm/eden/ci-values.yaml — passes
  • helm template single-experiment + lease-mode → valid k8s; empty-image + half-set-identity correctly rejected
  • uv run pytest -q — 2245 passed, 249 skipped; 1 failure is the known sandbox os.killpg EPERM artifact (test_ideator_subprocess), unrelated to this change
  • helm-smoke CI job (the kind end-to-end) — gating; watching to green

Related issues

🤖 Generated with Claude Code

@ealt ealt enabled auto-merge (squash) June 9, 2026 14:53
ealt and others added 19 commits June 9, 2026 07:55
Add a Helm chart (reference/helm/eden/) deploying the nine reference
services on Kubernetes 1.27+, parallel to the Compose stack. StatefulSets
for clone-holding services (orchestrator, executor/evaluator hosts, web-ui)
+ Postgres/Forgejo; Deployments for the stateless services. Bundles the
orchestrator --max-quiescent-iterations 0 "never exit" sentinel (Decision 9)
so the lease-driven orchestrator runs forever on restartPolicy: Always.

Two-phase bootstrap via setup-experiment-helm.sh (infra → seed → app tier
with the seed SHA), helm-lint + helm-smoke CI jobs, operator docs.

Deferrals: helm-upgrade-smoke (#284), cross-service artifact serving (#285),
required-status bump (#286).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…me derivation

- Split postgres/forgejo pullPolicy from image.pullPolicy so kind-loaded
  app image (Never) doesn't block upstream dependency image pulls.
- setup-experiment-helm.sh seeds the task-store orchestrators group with the
  orchestrator pod worker_ids (lease-driven mode self-joins only the
  control-plane group; #254 tracks the in-orchestrator fix).
- Derive resource names from the helm fullname + honor secrets.existingSecret;
  parse image refs on the last colon (registry ports); propagate
  imagePullSecrets to the repo-init Job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…L dep, CI bucket

- Seed task-store groups (incl. orchestrator pods) BEFORE registering the
  experiment with the control plane, so baseline creation isn't 403'd in the
  lease-acquisition window.
- --set-string for image.repository/tag so numeric/bool-looking tags survive.
- Pass the experiment config verbatim via --set-file experiment.configRaw
  (new chart value); drops the script's PyYAML dependency entirely (and the
  pyyaml/setup-python CI steps).
- Add reference/compose/Dockerfile + .dockerignore to the helm changes bucket
  so helm-smoke runs when the image under test changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docs

- Mount clone/artifacts PVCs via subPath so the clone target is an empty
  subdir — a fresh ext4 PVC root carries lost+found and git clone --bare
  refuses a non-empty destination (orchestrator/executor/evaluator/web-ui).
- Derive a DNS-safe repo-init Job name (sanitize + sha1 suffix) so experiment
  ids with underscores/uppercase don't get rejected by kubectl.
- values.schema.json caps replicas.webUi at 1 (operator-singleton in v0).
- docs/deployment/helm.md: direct second experiments to a separate
  release/namespace instead of rewriting the release-wide experiment.id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rship gap)

kubelet creates subPath dirs root-owned AFTER the fsGroup pass, so rootless
pods can't write subPath mounts on a fresh PVC. Replace the round-3 subPath
mounts:

- Repo clones (orchestrator/executor/evaluator/web-ui): mount the PVC at a
  dedicated parent (/var/lib/eden/clone) and clone into the child repo dir
  (--repo-path .../clone/repo). fsGroup makes the volume root writable; git
  creates the empty child — solving both lost+found and the ownership gap with
  no initContainer.
- web-ui artifacts: mount at the volume root (fsGroup-writable; stray
  lost+found is harmless to artifact serving).
- Forgejo: persist only /var/lib/gitea (data PVC, no subPath); /etc/gitea is a
  fresh emptyDir regenerated from the FORGEJO__* env each boot (SECRET_KEY /
  INTERNAL_TOKEN come from the persistent Secret, so it stays consistent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Register config_uri as file:///etc/eden/experiment-config.yaml (the mounted
  config resource per chapter 11), not the Forgejo git remote.
- docs/deployment/helm.md: reframe plain `helm install` as chart-resources-only
  (NOT a complete bootstrap) and point operators at setup-experiment-helm.sh.
- eden.fullname truncates to 40 (not 63) to reserve headroom for the longest
  per-resource suffix (-git-credential-helper, 22 chars), keeping every name
  within the 63-char DNS label limit; the setup script mirrors the truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ating

- setup-experiment-helm.sh preserves the existing baseCommitSha on rerun, so
  phase 1 no longer tears the live app tier down (repo-init is idempotent and
  phase 2 re-applies the same SHA).
- Ideator host (a Deployment) uses a fixed config.ideatorWorkerId (ideator-1)
  instead of $(POD_NAME): Deployment pod names can exceed the 64-char worker_id
  grammar for long release names. Executor/evaluator StatefulSets keep their
  bounded POD_NAME ids.
- Add tests/fixtures/** to the helm changes bucket so a fixture change runs the
  helm jobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- repo-init Job seeds into /var/lib/eden/seed/repo (a child of the staging
  mount) so the repo dir is eden-owned and `git push` doesn't trip git's
  dubious-ownership check (fsGroup leaves the mount root root-owned). Mirrors
  the StatefulSet clone pattern.
- values.schema.json caps replicas.ideatorHost at 1 (operator-singleton; fixed
  worker id), matching replicas.webUi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ap Job name

- Add checksum/secret + checksum/{experiment-config,git-credential-helper}
  rollout annotations to all 9 pod templates, so a re-run that rotates a secret
  or edits the config forces a rolling restart instead of leaving pods with
  stale envFrom values.
- Drop nameOverride support from the helpers (the setup script derives fullname
  in bash; a values-only override would silently diverge).
- Cap the repo-init Job name at 63 chars while preserving the hash suffix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Ideator Deployment uses strategy: Recreate so a rolling update doesn't run
  overlapping old+new pods sharing the one fixed --worker-id (credential race).
- CHANGELOG: link the subprocess/DooD deferral to #291 and reframe the 13b-13e
  roadmap phases as pre-planned scope (not deferrals introduced by this chunk),
  satisfying the deferral-tracking discipline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ll hazard

- ci-smoke.sh only deletes the kind cluster it created (refuses to reuse/delete
  a pre-existing cluster of the same name).
- Document the chart-managed-Secret vs retained-PVC reinstall hazard (regenerated
  POSTGRES_PASSWORD can't auth against the retained data dir) in secret.yaml,
  the chart README, and docs/deployment/helm.md §5 — steer reinstallable
  deployments to secrets.existingSecret.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The setup script's fullname trim used s/-*$// (strips a hyphen run); Helm's
trimSuffix "-" strips a single trailing hyphen. For custom release names whose
first 40 chars end in multiple hyphens the script would target names Helm never
rendered. Use s/-$// to match exactly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ale-up note

- README: a pre-seeded plain `helm install` only satisfies the app-tier render
  gate — it still needs experiment registration + group bootstrap; point to
  setup-experiment-helm.sh.
- Document that scaling replicas.orchestrator up requires re-running
  setup-experiment-helm.sh to add the new pod ids to the task-store
  orchestrators group (until the in-orchestrator self-join, #254); noted in
  docs/deployment/helm.md §5 and the script's success summary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The out-of-tree repo-init Job now renders nodeSelector/tolerations/affinity
from the release values (as JSON, which is valid YAML — keeps the script
YAML-library-free), so the seed Job schedules on the same constrained/tainted
nodes the chart pods target instead of going Pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… order

Three fixes surfaced by rebasing the 13a branch onto current main and
running the full validation suite (none were caught by the prior 13
codex rounds because they only emerged post-rebase / were never run):

- loop.py: the Decision-9 docstring + inline sentinel comments pushed
  run_orchestrator_loop to 105 lines, tripping the complexity gate
  (>100). Extract a named `_quiescence_exit_reached(max, quiescent)`
  predicate and co-locate the sentinel docstring with it. No behavior
  change; the function is back under threshold and the sentinel logic
  is now self-documenting (no inline comments needed).

- test_loop_unit.py: `test_loop_never_exits_on_quiescence_when_max_is_zero`
  used `updated_by="orchestrator"` / `terminated_by="orchestrator"`,
  which the #128 actor-id grammar (admin|wkr_*) — already in main when
  the branch was authored — rejects. The test was never green. Switch
  to `admin` (matching every other test in the file). Add a focused
  unit test for the extracted predicate.

- CHANGELOG.md: the rebase placed the 13a entry below the #131
  auto-checkpointing entry (with no separating blank line, a
  markdownlint MD032/MD022 failure). Move 13a to the top of
  [Unreleased] (it is the in-flight chunk per AGENTS.md) and restore
  the blank line after #131's deferral list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Phase 13a chart predated #128 (opaque-id minting) and forced
lease-driven multi-experiment mode. This ports it to the canonical
single-experiment path that compose-smoke validates:

- setup-experiment-helm.sh mints the opaque exp_* id + reserved groups
  (admins/orchestrators) + one wkr_* worker per service via the
  task-store-server (POST {"name": ...}), capturing each minted
  worker_id + registration_token, mirroring the Compose
  setup-experiment.sh. Three-phase bring-up: infra -> task-store-server
  -> mint identities -> app tier.
- The chart provisions each pod its minted --worker-id (was POD_NAME /
  literal ids) + a per-service identity Secret whose token an
  initContainer installs at /var/lib/eden/credentials/<worker_id>.token;
  the pod verifies it via /whoami. Each identity-consuming service gates
  on its identity.<svc>.workerId being set.
- Lease mode is now opt-in: orchestrator.leaseMode.enabled (default
  false). Only then is the control-plane deployed, EDEN_CONTROL_PLANE_URL
  set, and the experiment registered. Multi-replica HA + opaque-id
  reconciliation deferred behind #281. replicas.orchestrator default 1.
- Single-experiment mode now HONORS --max-quiescent-iterations 0 as a
  substrate override of the #157 config-wins behavior (else the
  orchestrator CrashLoopBackOffs on k8s).
- ci-values/ci-smoke use a valid opaque exp_* id (exp-1 is rejected by
  the event model post-#128). helm-lint renders both the single-exp and
  lease-mode stacks. helm-smoke stays enabled against the default path.
- README / docs/deployment/helm.md / CHANGELOG updated to the ported model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round on the ported chart surfaced that eden.identityEnabled
gated an identity-consuming workload on workerId ALONE, while the
per-service identity Secret renders only when BOTH workerId and token
are set. A partial/manual values state (workerId set, token empty) would
therefore render a pod referencing a Secret that never rendered. Tighten
the gate to require both, matching the Secret's render condition, so the
two stay consistent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round 2 converged (no smoke-blockers) but flagged:

1. web-ui credential path mismatch. web-ui resolves its per-experiment
   worker token through BearerCache at
   <credential-dir>/<experiment_id>/<worker_id>.token (distinct from the
   worker hosts' flat <dir>/<worker_id>.token), and the chart passed no
   --credential-dir, so it used the XDG default and never saw the
   provisioned token (surviving only via admin-reissue fallback). Fix:
   the identity initContainer now installs the web-ui token at the
   experiment-namespaced path (eden.identityTokenPath keys on the
   service), and the web-ui pod gets --credential-dir
   /var/lib/eden/credentials. Worker hosts keep the flat path.
2. values.yaml + schema described the identity gate as workerId-only;
   updated to "both workerId and token", and values.schema.json now
   constrains each serviceIdentity to both-empty-or-both-set (oneOf), so
   a half-set hand-written values file fails at lint/template time.
3. Extended values.schema.json to cover experiment.configRaw, postgres.*,
   forgejo.*, and top-level resources/nodeSelector/tolerations/affinity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ealt ealt force-pushed the impl/issue-171-helm-base-chart branch from 44e4c6f to 92ea7a9 Compare June 9, 2026 14:58
@ealt ealt merged commit 07482b6 into main Jun 9, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 13a — Helm base chart for the reference deployment

1 participant