Skip to content

Implement: S3/GCS blob backend (Phase 13d, #174)#304

Merged
ealt merged 4 commits into
mainfrom
impl/issue-174-blob-backend
Jun 10, 2026
Merged

Implement: S3/GCS blob backend (Phase 13d, #174)#304
ealt merged 4 commits into
mainfrom
impl/issue-174-blob-backend

Conversation

@ealt

@ealt ealt commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

  • Ships Phase 13d (the fourth Phase-13 substrate chunk): S3Backend (boto3; AWS S3 + any S3-compatible service, e.g. MinIO via endpoint_url) and GcsBackend (google-cloud-storage) implementations of the Wire-level artifact upload endpoint (POST/GET /v0/experiments/<id>/artifacts) for distributed deployments #166 ArtifactBackend Protocol, as eden-storage optional extras so plain installs pull in neither SDK.
  • Task-store-server gains --blob-backend file|s3|gcs + per-backend flags via a new build_artifact_backend factory (required-bucket validation, stray-flag rejection, the Wire-level artifact upload endpoint (POST/GET /v0/experiments/<id>/artifacts) for distributed deployments #166 blob-dir/artifacts-dir overlap check). Credentials never reach argv — SDK default chains only (IRSA / instance profile / Workload Identity / env / GOOGLE_APPLICATION_CREDENTIALS).
  • The 13a Helm chart gains a blob.* values block: default file now provisions a keep-annotated PVC (upgrading k8s deposits from the non-durable in-memory fallback to durable-by-default, no operator values required); s3/gcs require an operator bucket + exactly one auth path, enforced in values.schema.json at lint time (the 13a/13c no-fictional-defaults posture). Operator runbook: docs/deployment/migrating-to-blob-backend.md.

Plan-supersession note (the load-bearing scope call). The 13d plan pre-dates #166. Under #166 the backend is server-side only, keyed by the server-minted opaque id, and the wire URI is always eden://artifacts/<id> — so the plan's eden-blob package, s3:///gs:// client-facing URI schemes, per-worker-host flags, evaluator-pod bucket credentials, and CompositeBackend migration mode are structurally unnecessary, not deferred. Only the task-store-server pod holds bucket credentials; legacy file:// rows keep resolving through the unchanged --artifacts-dir path regardless of backend (no row rewrite). Full narration in the CHANGELOG entry. Chart naming follows the merged plan's blob.backend ∈ {file, s3, gcs} (matches FileArtifactBackend + the CLI enum; glossary no-synonyms rule) rather than the operator-brief's tentative blobBackend.mode={localfs,…} spelling.

No-overwrite + 404 narrowing (codex-driven). S3 store is an atomic IfNoneMatch="*" conditional write (412 → FileExistsError; one retry on 409 ConditionalRequestConflict; boto3 floor bumped to ≥1.36); GCS uses if_generation_match=0. S3 load maps only NoSuchKey to NotFound (bucket-level 404s propagate as deployment errors); GCS disambiguates bucket-level NotFound via bucket.exists() with a Forbidden fallback for least-privilege roles.

Conflict awareness: all chart changes are additive (new top-level blob: block, two new template files, arg insertions, one new schema property + allOf) so the in-flight 13c managed-Postgres PR (#173) rebases mechanically whichever lands second.

What this does NOT cover

Fresh-operator walkthrough

  • A fresh-operator walkthrough was performed against the changed surface.
  • Notes: CLI — --blob-backend s3 without bucket fails at startup with --blob-s3-bucket is required with --blob-backend s3.; stray --blob-s3-bucket under file mode fails with --blob-s3-bucket require(s) --blob-backend s3 (got 'file').; configured s3 mode starts clean and announces EDEN_TASK_STORE_LISTENING; default file mode without a dir starts with the loud NON-DURABLE in-memory backend warning; file mode with --artifact-blob-dir starts clean with no warning. Helm — all three modes render (helm template verified for file / s3+IRSA / s3+static / gcs+WI / gcs+key); blob.backend=s3 without bucket, missing auth, both-auth-paths, and blob.backend=localfs (invalid enum) all fail closed at template time. Passed cleanly; no issues filed.

Test plan

  • uv run ruff check . — clean
  • uv run pyright — 0 errors
  • uv run pytest -q — 2334 passed, 254 skipped, 1 pre-existing tracked flake (Flaky on macOS: Subprocess.terminate() os.killpg raises PermissionError (EPERM) — test_dispatch_collects_ideas #303; passes on re-run)
  • npx --yes markdownlint-cli2@0.14.0 … — 0 errors
  • python3 scripts/check-complexity.py / check-rename-discipline.py / spec-xref-check.py — clean
  • helm lint reference/helm/eden -f reference/helm/eden/ci-values.yaml + all-mode helm template renders + fail-closed negatives — clean
  • bash reference/compose/healthcheck/smoke.sh — PASS (default file posture unchanged)
  • Codex review, 3 synchronous rounds (records committed under docs/plans/review/eden-phase-13d-blob-backend/impl/20260609T232640/): round 0 fix-then-ship (7 findings, all fixed), round 1 verified + 2 new (fixed), round 2 ship.

Related issues

🤖 Generated with Claude Code

ealt and others added 4 commits June 9, 2026 23:16
S3Backend (boto3) + GcsBackend (google-cloud-storage) implement the #166
ArtifactBackend Protocol as eden-storage optional extras; the
task-store-server gains --blob-backend file|s3|gcs + per-backend flags
via a new build_artifact_backend factory; the Helm chart gains the
blob.* values block (default file = chart-managed keep-PVC; s3/gcs =
operator bucket + exactly-one auth path enforced in values.schema.json);
helm-lint CI renders all three modes + fail-closed negatives; operator
runbook at docs/deployment/migrating-to-blob-backend.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ly write, replica cap

- S3Backend.load maps only NoSuchKey to NotFound (NoSuchBucket and other
  404-shaped deployment errors propagate); GcsBackend.load disambiguates
  bucket-level NotFound via bucket.exists() on the absent path.
- S3Backend.store uses IfNoneMatch='*' conditional write (412 ->
  FileExistsError), replacing HEAD-then-PUT; boto3 floor bumped to 1.36;
  StreamingBody closed after read.
- values.schema.json caps replicas.taskStoreServer at 1.
- Regression tests for all of the above; runbook + CHANGELOG updated;
  round-0 review record committed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…flict retry

- GcsBackend.load catches Forbidden from bucket.exists() (object-only
  IAM roles lack storage.buckets.get) and falls back to NotFound;
  runbook recommends granting buckets.get for misconfig diagnostics.
- S3Backend.store retries once on 409 ConditionalRequestConflict.
- Regression tests for both; round-1 review record committed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ealt ealt enabled auto-merge (squash) June 10, 2026 06:59
@ealt ealt merged commit 593e9ff into main Jun 10, 2026
23 checks passed
ealt added a commit that referenced this pull request Jun 10, 2026
…ckend conflicts)

Tree is identical to the validated rebase commit a931967 — values.schema.json
allOf clauses merged with #304's blob.backend clauses, CHANGELOG keeps both
[Unreleased] entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 13d — S3/GCS blob backend for the artifact substrate

1 participant