Skip to content

BRC Data Store: where to host data assets #1326

@NoopDog

Description

@NoopDog

TL;DR — Where should we host BRC Analytics' data assets (the "BRC Data Store")? Proposed: the Galaxy datacache (CVMFS), with TACC Corral (S3) and Galaxy histories as alternatives considered.

We need a home for the pangenome initiative's downloadable data and how people,
the app, and tools get at it. Related: vision #1279, epic #1322.

Scope

  • Size: the first species bundle (P. vivax, Add P. vivax pangenome bundle to BRC-analytics (first deliverable: Pangenome tab on organism page) #1279) is ~350 GB (~339 GB
    large binaries + ~5,800 small per-gene bundles + a little git-tracked text).
    One bundle per species → multi-TB as it grows. Genome-browser track files live
    on UCSC, so that slice isn't ours to host.
  • Consumers: (1) people downloading files; (2) the app's per-gene browser
    (~5,800 results to search/filter — likely loaded into a DB, with the store as
    the source of truth); (3) Galaxy and other tools reading the data as input.
  • Writing: creation runs as a Galaxy workflow, so outputs land in Galaxy
    first and then need publishing out to the chosen store.

Proposed: Galaxy datacache (CVMFS)

Lean toward the Galaxy datacache as the BRC Data Store. BRC already uses it
(the /brc area for genome indexes), creation is Galaxy-native so publishing there
is the natural path, it handles many small files well, and it's distributed across
public Stratum-1 replicas. Genome-browser track files still go to UCSC hubs
(already planned in #1279), independent of this choice.

To settle:

  • HTTPS. The datacache serves over HTTP only. Click-through download links
    work, but in-app JavaScript fetches (and clean links from our HTTPS site without
    mixed-content issues) need HTTPS — so we'd likely front it with an HTTPS
    proxy.
  • Write path. Workflow outputs are published via a CVMFS publish step
    (Stratum-0 → Stratum-1) — a batched publish, not a live write. Confirm who owns
    it.
  • Operational fit. The datacache's layout, lifecycle, and retention are tuned
    for reference data; confirm hosting pangenome assets alongside that is acceptable.

Other options considered

  • TACC Corral (S3). S3 v4 interface, by request, 25 TB minimum. We'd
    control layout/access and get native S3/HTTPS for tools; Galaxy can write to it
    directly. Would need confirming whether it serves public/presigned HTTPS download
    links. A fallback if the datacache doesn't fit operationally — though it may not
    support high-throughput or repeated read access patterns at the moment, and may
    not be a natural fit for Galaxy workflow access.
  • Galaxy histories / data libraries. Low-friction in-Galaxy sharing since
    creation is Galaxy-native, but tied to a Galaxy instance and permissioned through
    it — better for sharing within Galaxy than a public store for a separate site and
    external tools.

Open decisions

  1. Confirm the datacache is an operational fit (layout/lifecycle/retention) for
    pangenome assets.
  2. HTTPS: do we need it (in-app fetch / clean public links), and stand up a proxy
    in front of the datacache?
  3. Who owns the CVMFS publish step from the Galaxy workflow outputs.
  4. If the datacache doesn't fit, fall back to Corral S3 — and then confirm its
    public-HTTPS capability, allocation owner, and size.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions