You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR — Where should we host BRC Analytics' data assets (the "BRC Data Store")? Proposed: the Galaxy datacache (CVMFS), with TACC Corral (S3) and Galaxy histories as alternatives considered.
We need a home for the pangenome initiative's downloadable data and how people,
the app, and tools get at it. Related: vision #1279, epic #1322.
Consumers: (1) people downloading files; (2) the app's per-gene browser
(~5,800 results to search/filter — likely loaded into a DB, with the store as
the source of truth); (3) Galaxy and other tools reading the data as input.
Writing: creation runs as a Galaxy workflow, so outputs land in Galaxy
first and then need publishing out to the chosen store.
Proposed: Galaxy datacache (CVMFS)
Lean toward the Galaxy datacache as the BRC Data Store. BRC already uses it
(the /brc area for genome indexes), creation is Galaxy-native so publishing there
is the natural path, it handles many small files well, and it's distributed across
public Stratum-1 replicas. Genome-browser track files still go to UCSC hubs
(already planned in #1279), independent of this choice.
To settle:
HTTPS. The datacache serves over HTTP only. Click-through download links
work, but in-app JavaScript fetches (and clean links from our HTTPS site without
mixed-content issues) need HTTPS — so we'd likely front it with an HTTPS proxy.
Write path. Workflow outputs are published via a CVMFS publish step
(Stratum-0 → Stratum-1) — a batched publish, not a live write. Confirm who owns
it.
Operational fit. The datacache's layout, lifecycle, and retention are tuned
for reference data; confirm hosting pangenome assets alongside that is acceptable.
Other options considered
TACC Corral (S3). S3 v4 interface, by request, 25 TB minimum. We'd
control layout/access and get native S3/HTTPS for tools; Galaxy can write to it
directly. Would need confirming whether it serves public/presigned HTTPS download
links. A fallback if the datacache doesn't fit operationally — though it may not
support high-throughput or repeated read access patterns at the moment, and may
not be a natural fit for Galaxy workflow access.
Galaxy histories / data libraries. Low-friction in-Galaxy sharing since
creation is Galaxy-native, but tied to a Galaxy instance and permissioned through
it — better for sharing within Galaxy than a public store for a separate site and
external tools.
Open decisions
Confirm the datacache is an operational fit (layout/lifecycle/retention) for
pangenome assets.
HTTPS: do we need it (in-app fetch / clean public links), and stand up a proxy
in front of the datacache?
Who owns the CVMFS publish step from the Galaxy workflow outputs.
If the datacache doesn't fit, fall back to Corral S3 — and then confirm its
public-HTTPS capability, allocation owner, and size.
TL;DR — Where should we host BRC Analytics' data assets (the "BRC Data Store")? Proposed: the Galaxy datacache (CVMFS), with TACC Corral (S3) and Galaxy histories as alternatives considered.
We need a home for the pangenome initiative's downloadable data and how people,
the app, and tools get at it. Related: vision #1279, epic #1322.
Scope
large binaries + ~5,800 small per-gene bundles + a little git-tracked text).
One bundle per species → multi-TB as it grows. Genome-browser track files live
on UCSC, so that slice isn't ours to host.
(~5,800 results to search/filter — likely loaded into a DB, with the store as
the source of truth); (3) Galaxy and other tools reading the data as input.
first and then need publishing out to the chosen store.
Proposed: Galaxy datacache (CVMFS)
Lean toward the Galaxy datacache as the BRC Data Store. BRC already uses it
(the
/brcarea for genome indexes), creation is Galaxy-native so publishing thereis the natural path, it handles many small files well, and it's distributed across
public Stratum-1 replicas. Genome-browser track files still go to UCSC hubs
(already planned in #1279), independent of this choice.
To settle:
work, but in-app JavaScript fetches (and clean links from our HTTPS site without
mixed-content issues) need HTTPS — so we'd likely front it with an HTTPS
proxy.
(Stratum-0 → Stratum-1) — a batched publish, not a live write. Confirm who owns
it.
for reference data; confirm hosting pangenome assets alongside that is acceptable.
Other options considered
control layout/access and get native S3/HTTPS for tools; Galaxy can write to it
directly. Would need confirming whether it serves public/presigned HTTPS download
links. A fallback if the datacache doesn't fit operationally — though it may not
support high-throughput or repeated read access patterns at the moment, and may
not be a natural fit for Galaxy workflow access.
creation is Galaxy-native, but tied to a Galaxy instance and permissioned through
it — better for sharing within Galaxy than a public store for a separate site and
external tools.
Open decisions
pangenome assets.
in front of the datacache?
public-HTTPS capability, allocation owner, and size.