Skip to content

Wire-level artifact upload endpoint (POST/GET /v0/experiments/<id>/artifacts) for distributed deployments #166

@ealt

Description

@ealt

Background

The current artifact substrate model relies on a shared filesystem between workers, task-store-server, and web-ui (the ${EDEN_EXPERIMENT_DATA_ROOT}/artifacts/ bind-mount, surfaced as /var/lib/eden/artifacts/ inside containers). This works for single-machine demos but breaks down for distributed deployments:

Property Today (single-machine) Distributed reality
Workers can write to substrate Via shared bind-mount No shared FS exists
Workers can read each other's artifacts Yes — full read of bind-mount Security violation: workers shouldn't read others' submissions unless explicitly granted
Workers can overwrite each other's artifacts Yes — full write Critical: malicious/buggy worker corrupts experiment record
Artifacts survive worker machine failure Bind-mount durability Worker-local files vanish on instance loss

Discovered during the 2026-05-22 manual demo session.

Proposal

Move artifacts from "shared filesystem" to wire-level deposit/retrieve:

Deposit (worker → server)

POST /v0/experiments/<id>/artifacts
  body: multipart/form-data
        name="file"; filename="<name>"; content-type=...
        binary bytes
  auth: worker bearer
  response: 201
    {
      "artifacts_uri": "eden://artifacts/<opaque-id>",
      "size_bytes": N,
      "content_type": "..."
    }

The returned artifacts_uri uses a new opaque scheme (eden://artifacts/<opaque-id>) — NOT file://. Workers don't know where the bytes live physically. Server-side resolves the opaque id to actual storage on retrieval.

Retrieve (anyone with read permission → server)

GET /v0/experiments/<id>/artifacts/<opaque-id>
  auth: bearer with read permission on this artifact
  response: 200 with the bytes

Auth model

  • A worker can deposit; their worker_id is recorded as the artifact's created_by.
  • An artifact deposited as part of a submission is readable by:
    • The depositing worker
    • Anyone with admin bearer (deployment-scoped read)
    • The role that operates on the variant the artifact belongs to (e.g., evaluator can read executor's artifact for a variant they're evaluating)
  • This is operator-facing access control, not the kind of role-based ACL — could lean on existing admins/orchestrators group machinery; needs scoping per Web UI sign-ups are non-admin by default; admins explicitly promote others to admins group #143.

Storage backend abstraction

Behind the wire endpoint, the server uses a Backend Protocol (already tracked under #102 for the checkpoint-substrate rewrite + Phase 13d plan for blob backends):

  • Local (Compose default): writes to ${EDEN_EXPERIMENT_DATA_ROOT}/artifacts/ with random filenames; the URI's opaque id maps to filename
  • S3 / GCS / Azure (production): writes to a blob store
  • In-process (test): an in-memory dict

Operators never see the physical layout; they only see opaque URIs.

Compatibility with existing artifacts

Existing file:// URIs continue to work (read-side) via the current /artifacts?uri=... route. The new eden://artifacts/<id> URIs are surfaced via the new endpoints. Migration: any deployment past v0 wouldn't emit file:// URIs; the operator-facing URI scheme converges on eden://artifacts/....

The web-ui's /artifacts?uri=... route gets retired; the unified GET /v0/experiments/<id>/artifacts/<id> becomes the read path for all artifacts.

Spec implications

  • spec/v0/02-data-model.md: clarify that artifacts_uri is opaque from the operator's perspective; deployment-local resolution is the substrate's job.
  • spec/v0/07-wire-protocol.md: new POST/GET endpoints + new opaque URI scheme.
  • spec/v0/10-checkpoints.md §7: align with the eden:// scheme; closes the gap that Future: artifact substrate rewrite for portable checkpoints (v1+checkpoints+artifacts) #102 was deferred against.
  • spec/v0/09-conformance.md: scenarios for deposit, retrieve, auth-gating, cross-worker isolation.

Dependencies

Out of scope

  • Resumable uploads / multipart-with-checksum (could be added later).
  • Per-artifact lifecycle (TTL, expiration) — deferred.
  • Cross-experiment artifact sharing — out of scope by design.

Estimated effort

Large. ~3-4 weeks: wire endpoints + Backend Protocol + auth wiring + migration of existing call sites + web-ui form changes (overlap with #120) + spec amendments + conformance scenarios.

Could chunk into: wire endpoints with file-backend only (1 PR) → Backend Protocol abstraction (1 PR) → S3 backend (1 PR, post Phase 13d) → web-ui upload form (#120).

Filing notes

Discovered during the 2026-05-22 manual demo session — operator asked "what about a worker on another machine?" The current model has no answer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cluster:durabilitySubstrate persistence, recovery, checkpointsenhancementNew feature or requestpriority:2-plannedPlanned roadmap work; well-scoped, not urgenttriage:needs-planIssue triage disposition

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions