Implement: S3/GCS blob backend (Phase 13d, #174)#304
Merged
Conversation
S3Backend (boto3) + GcsBackend (google-cloud-storage) implement the #166 ArtifactBackend Protocol as eden-storage optional extras; the task-store-server gains --blob-backend file|s3|gcs + per-backend flags via a new build_artifact_backend factory; the Helm chart gains the blob.* values block (default file = chart-managed keep-PVC; s3/gcs = operator bucket + exactly-one auth path enforced in values.schema.json); helm-lint CI renders all three modes + fail-closed negatives; operator runbook at docs/deployment/migrating-to-blob-backend.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ly write, replica cap - S3Backend.load maps only NoSuchKey to NotFound (NoSuchBucket and other 404-shaped deployment errors propagate); GcsBackend.load disambiguates bucket-level NotFound via bucket.exists() on the absent path. - S3Backend.store uses IfNoneMatch='*' conditional write (412 -> FileExistsError), replacing HEAD-then-PUT; boto3 floor bumped to 1.36; StreamingBody closed after read. - values.schema.json caps replicas.taskStoreServer at 1. - Regression tests for all of the above; runbook + CHANGELOG updated; round-0 review record committed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…flict retry - GcsBackend.load catches Forbidden from bucket.exists() (object-only IAM roles lack storage.buckets.get) and falls back to NotFound; runbook recommends granting buckets.get for misconfig diagnostics. - S3Backend.store retries once on 409 ConditionalRequestConflict. - Regression tests for both; round-1 review record committed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ealt
added a commit
that referenced
this pull request
Jun 10, 2026
This was referenced Jun 10, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
S3Backend(boto3; AWS S3 + any S3-compatible service, e.g. MinIO viaendpoint_url) andGcsBackend(google-cloud-storage) implementations of the Wire-level artifact upload endpoint (POST/GET /v0/experiments/<id>/artifacts) for distributed deployments #166ArtifactBackendProtocol, aseden-storageoptional extras so plain installs pull in neither SDK.--blob-backend file|s3|gcs+ per-backend flags via a newbuild_artifact_backendfactory (required-bucket validation, stray-flag rejection, the Wire-level artifact upload endpoint (POST/GET /v0/experiments/<id>/artifacts) for distributed deployments #166 blob-dir/artifacts-dir overlap check). Credentials never reach argv — SDK default chains only (IRSA / instance profile / Workload Identity / env /GOOGLE_APPLICATION_CREDENTIALS).blob.*values block: defaultfilenow provisions a keep-annotated PVC (upgrading k8s deposits from the non-durable in-memory fallback to durable-by-default, no operator values required);s3/gcsrequire an operator bucket + exactly one auth path, enforced invalues.schema.jsonat lint time (the 13a/13c no-fictional-defaults posture). Operator runbook:docs/deployment/migrating-to-blob-backend.md.Plan-supersession note (the load-bearing scope call). The 13d plan pre-dates #166. Under #166 the backend is server-side only, keyed by the server-minted opaque id, and the wire URI is always
eden://artifacts/<id>— so the plan'seden-blobpackage,s3:///gs://client-facing URI schemes, per-worker-host flags, evaluator-pod bucket credentials, andCompositeBackendmigration mode are structurally unnecessary, not deferred. Only the task-store-server pod holds bucket credentials; legacyfile://rows keep resolving through the unchanged--artifacts-dirpath regardless of backend (no row rewrite). Full narration in the CHANGELOG entry. Chart naming follows the merged plan'sblob.backend ∈ {file, s3, gcs}(matchesFileArtifactBackend+ the CLI enum; glossary no-synonyms rule) rather than the operator-brief's tentativeblobBackend.mode={localfs,…}spelling.No-overwrite + 404 narrowing (codex-driven). S3 store is an atomic
IfNoneMatch="*"conditional write (412 →FileExistsError; one retry on 409ConditionalRequestConflict; boto3 floor bumped to ≥1.36); GCS usesif_generation_match=0. S3loadmaps onlyNoSuchKeytoNotFound(bucket-level 404s propagate as deployment errors); GCS disambiguates bucket-levelNotFoundviabucket.exists()with aForbiddenfallback for least-privilege roles.Conflict awareness: all chart changes are additive (new top-level
blob:block, two new template files, arg insertions, one new schema property +allOf) so the in-flight 13c managed-Postgres PR (#173) rebases mechanically whichever lands second.What this does NOT cover
helm-smoke-blob-s3against an in-cluster MinIO, plan §3.10 / Decision 8) — helm-lint renders all three modes + fail-closed negatives, and the backends are unit-tested against duck-typed fakes raising the real SDK exception classes, but no real S3 wire runs in CI. Tracked in helm-smoke-blob-s3: end-to-end S3-wire smoke against an in-cluster MinIO #302.Subprocess.terminate()os.killpgraises EPERM on macOS for just-exited children (3/10 flake rate ontest_dispatch_collects_ideas, byte-identical code tomain, ubuntu CI unaffected). Root-caused + filed as Flaky on macOS: Subprocess.terminate() os.killpg raises PermissionError (EPERM) — test_dispatch_collects_ideas #303; not fixed here (outside 13d scope, lives ineden-service-common).Fresh-operator walkthrough
--blob-backend s3without bucket fails at startup with--blob-s3-bucket is required with --blob-backend s3.; stray--blob-s3-bucketunder file mode fails with--blob-s3-bucket require(s) --blob-backend s3 (got 'file').; configured s3 mode starts clean and announcesEDEN_TASK_STORE_LISTENING; default file mode without a dir starts with the loudNON-DURABLE in-memory backendwarning; file mode with--artifact-blob-dirstarts clean with no warning. Helm — all three modes render (helm templateverified for file / s3+IRSA / s3+static / gcs+WI / gcs+key);blob.backend=s3without bucket, missing auth, both-auth-paths, andblob.backend=localfs(invalid enum) all fail closed at template time. Passed cleanly; no issues filed.Test plan
uv run ruff check .— cleanuv run pyright— 0 errorsuv run pytest -q— 2334 passed, 254 skipped, 1 pre-existing tracked flake (Flaky on macOS: Subprocess.terminate() os.killpg raises PermissionError (EPERM) — test_dispatch_collects_ideas #303; passes on re-run)npx --yes markdownlint-cli2@0.14.0 …— 0 errorspython3 scripts/check-complexity.py/check-rename-discipline.py/spec-xref-check.py— cleanhelm lint reference/helm/eden -f reference/helm/eden/ci-values.yaml+ all-modehelm templaterenders + fail-closed negatives — cleanbash reference/compose/healthcheck/smoke.sh— PASS (default file posture unchanged)docs/plans/review/eden-phase-13d-blob-backend/impl/20260609T232640/): round 0 fix-then-ship (7 findings, all fixed), round 1 verified + 2 new (fixed), round 2 ship.Related issues
helm-smoke-blob-s3(filed this chunk)killpgEPERM flake (filed this chunk)🤖 Generated with Claude Code