A self-contained, single-host deployment of HySDS —
the Hybrid Cloud Science Data System framework NASA missions use to
process science data at scale. Designed to get you from git clone to a
running cluster with registered jobs, populated dashboards, and a working
smoke-test submission in about 30 minutes.
If you've never touched HySDS before, start here.
- What is HySDS?
- Architecture at a glance
- Prerequisites
- Quick start
- The smoke test
- What ships pre-configured
- The HySDS UIs
- Accessing the UIs from your laptop
- Credentials
- Bringing your own PGE
- Configuration reference (
.env) - Repository layout
- Operations
- Troubleshooting
- Going further
HySDS is an open-source framework for scheduling, executing, and cataloging containerized science-data jobs. NASA missions (NISAR, SWOT, OCO, OPERA…) use it to run hundreds of thousands of jobs per day across mixed on-prem + cloud infrastructure.
| Problem | HySDS answer |
|---|---|
| Where do jobs go? | A mozart orchestrator pushes job specs onto Celery/RabbitMQ queues. |
| Who runs them? | verdi workers pull from queues and spawn containerized PGEs. |
| Where do products go? | The PGE writes to a configured object store; grq catalogs the metadata. |
| How do I see what's happening? | A metrics node ingests events; Figaro, Tosca, and OpenSearch Dashboards visualize them. |
A HySDS "PGE" (Product Generation Executable) is just a container with a declared command and input/output contract. HySDS knows nothing about your science — it knows how to schedule, execute, persist, and catalog containers reliably.
A production HySDS deployment spans many hosts. This quick-start collapses
everything onto one Linux host using rootless podman
and podman compose.
┌────────────────────────────────────────────────┐
│ your single host │
│ │
browser ────▶ │ ┌────────────────┐ ┌────────────────┐ │
│ │ hysds_ui │ │ dashboards │ │
│ │ Figaro + Tosca │ │ OpenSearch │ │
│ │ React SPA + │ │ + Job/Worker │ │
│ │ nginx proxy │ │ Metrics │ │
│ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │
│ /mozart/api ─▶ mozart ◀──▶│ │
│ /grq/api ─▶ grq ◀──▶│ │
│ /*_es ─▶ opensearch─┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ mozart │ │ grq │ │ metrics │ │
│ │ orchestr │ │ catalog │ │ host + │ │
│ │ + REST + │ │ + REST + │ │ job info │ │
│ │ RabbitMQ │ │ OS 2.x │ │ OS 2.x │ │
│ │ OS 2.x │ └──────────┘ └──────────┘ │
│ └────┬─────┘ │
│ │ celery task │
│ ▼ │
│ ┌──────────┐ spawn ┌──────────────┐ │
│ │ verdi │ ─────────▶ │ PGE │ │
│ │ worker │ podman.sock│ container │ │
│ └──────────┘ └──────────────┘ │
│ │
│ Persistent state: $HYSDS_HOME on host fs │
└────────────────────────────────────────────────┘
Every component runs on the shared podman network hysds_net and addresses
its peers by compose service name (e.g. mozart-rabbitmq,
grq-elasticsearch).
| Component | Role | Primary port |
|---|---|---|
| hysds_ui | React Figaro + Tosca SPA + nginx proxy for all backends | :8480 |
| mozart | Job orchestration, REST API, RabbitMQ broker, celery workers | :8888, :15672 |
| grq | Dataset catalog (geo-region query) + REST | :8878 |
| metrics | Host + per-job metrics aggregator with logstash indexer | :9500 |
| verdi | Worker — pulls jobs, spawns PGE containers via podman.sock | :8085 (WebDAV) |
| dashboards | OpenSearch Dashboards pre-loaded with Job/Worker metrics | :5601 |
1. POST /job/submit ─▶ mozart REST receives job_spec + queue + tags
2. orchestrator_jobs worker ─▶ validates spec, pushes celery task onto the queue
3. RabbitMQ ─▶ fans task out to a verdi worker
4. verdi ─▶ via podman.sock spawns the PGE container
5. PGE ─▶ runs; writes <dataset>.dataset.json + .met.json
6. verdi post-process ─▶ publishes dataset to the configured store,
indexes metadata in grq, metrics in metrics
7. orchestrator_datasets ─▶ evaluates user_rules to fire downstream jobs
Step 5 is the only piece you write. Everything else is the framework.
Runs on one Linux host (laptop with virtualization, on-prem server, or cloud VM). macOS and Windows users should install inside a Linux VM — rootless podman on non-Linux hosts is too fiddly for a getting-started exercise.
| Requirement | Check |
|---|---|
| Linux x86_64, ≥ 4 vCPU, ≥ 16 GB RAM | nproc, free -h |
~30 GB free disk under $HYSDS_HOME |
df -h |
podman 4.0+ |
podman --version |
podman compose or podman-compose |
podman compose version |
jq, curl, gzip, sha256sum |
dnf install jq curl gzip |
vm.max_map_count ≥ 262144 (OpenSearch) |
sysctl vm.max_map_count |
/etc/subuid, /etc/subgid for your user |
grep $USER /etc/subuid |
| User-level podman socket enabled | systemctl --user status podman.socket |
scripts/bootstrap.sh does this end-to-end (host packages, sysctl, subuid,
podman socket, linger, podman-compose, plus a .env populated with the
running user's uid / home). Idempotent — safe to re-run on the same host or
as a different user.
./scripts/bootstrap.shIf you'd rather drive the steps by hand:
sudo dnf install -y podman jq curl git python3-pip
sudo sysctl -w vm.max_map_count=262144
echo 'vm.max_map_count=262144' | sudo tee /etc/sysctl.d/99-opensearch.conf
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $USER
systemctl --user enable --now podman.socket
sudo loginctl enable-linger $USER # keep user services alive after logout
pip install --user podman-compose # only if 'podman compose' is missingIf your install host can reach docker.io, this is all you need:
git clone <this repo url> hysds-quick-start
cd hysds-quick-start
./scripts/bootstrap.sh # one-time host + per-user prereqs; auto-fills .env
$EDITOR .env # set OPENSEARCH_PASSWORD (everything else is auto)
./install.sh all # ~10-20 min: image pulls + builds + compose up
./scripts/smoke-test.sh # ~30 sec end-to-end proofAt the end, install.sh prints the UI URLs with credentials. Open
http://localhost:8480/ (or forward it — see
Accessing the UIs from your laptop).
If the install host has no outbound internet, build the image bundle on a staging host that does:
# === on the staging host (has docker.io access) ===
cd hysds-quick-start
cp .env.example .env
./bundle.sh # produces hysds-<ver>-bundle.tar.gz + .sha256Copy the tarball, its .sha256, and this repo to the install host. Then:
# === on the install host ===
cd hysds-quick-start
./scripts/bootstrap.sh # host + per-user prereqs; auto-fills .env
$EDITOR .env # rotate OPENSEARCH_PASSWORD, etc. as needed
./install.sh all # checksum-verifies, podman load, compose up
./scripts/smoke-test.shinstall.sh is idempotent — re-running it re-applies config, rebuilds any
stopped components, and re-syncs state without touching persistent data.
./scripts/smoke-test.sh exercises the entire pipeline end-to-end:
1/4 Register hello-world specs in mozart
POST /container/add docker.io/hysds/hello-world:v1.0
POST /job_spec/add job-hello_world:v1.0
POST /hysds_io/add hysds-io-hello_world:v1.0 (mozart + grq)
2/4 Submit the job
POST /job/submit queue=hello_world_worker
◀── { "result": "<celery-task-id>" }
3/4 Poll grq for the published product (timeout 600s)
baseline hello_world_product count: N
─── verdi pulls task; spawns PGE container
─── PGE writes dataset.json + met.json + 5 MB fake_data.dat
─── verdi publishes to file:///data/datasets/...
─── verdi indexes metadata in grq
count=N+1
new product indexed in GRQ.
4/4 Verify dataset landed in grq
SUCCESS: N+1 hello_world_product dataset(s) indexed.
Smoke test passed.
When this prints Smoke test passed. you have a fully working HySDS cluster:
registered PGE, successful job execution, catalog entry on-disk at
$HYSDS_HOME/datasets/hello_world_product/v1.0/<id>/:
hello_world-product-<ts>-<hash>.dataset.json
hello_world-product-<ts>-<hash>.met.json
_publish.context.json
fake_data.dat
After install.sh all finishes, the cluster already has:
-
14 registered job specs ready for Figaro's On-Demand form:
hello_worldplus all 13 upstream HySDS system jobs (lw-mozart-{purge,retry,revoke,reprioritize,notify-by-email},lw-tosca-{aws_get,notify-by-email,purge,wget,wget-email,wget-glob,wget-product},lightweight-echo). -
Two OpenSearch Dashboards (courtesy of upstream sdscli):
- Job Metrics — throughput, duration distributions, failed jobs, container/version churn.
- Worker Metrics — worker heartbeat timeline, tasks-per-worker, per-host
CPU / memory / disk from
instance_stats.
-
OpenSearch index templates + aliases on mozart's cluster for
job_status,task_status,worker_status,event_status. Polymorphic subtrees (job.params,job.context,event) are stored in_sourcebut not indexed, so mapping doesn't conflict when different job types arrive. -
Workers for every queue mozart routes to:
hello_world_worker,system-jobs-queue(verdi), andorchestrator_{jobs,datasets},user_rules_{job,dataset,trigger},process_events_tasks,on_demand_{job,dataset}(mozart). -
Logstash indexers running on mozart (job/task/worker/event status) and metrics (
logstash-YYYY.MM.ddfor the dashboards), using the OpenSearch output plugin over HTTPS with admin auth. -
Unified origin at
:8480— thehysds_uinginx proxies/mozart/,/grq/,/mozart_es/,/grq_es/,/dashboards/,/swaggerui/, and/verdi/to the right backends. One tunnel / one port is enough for day-to-day use.
http://<host>:8480/
Combined React SPA. Top-bar nav:
- Tosca — faceted dataset browser backed by
grq_*indices. Map view, metadata viewer, on-demand submissions against selected result sets. - Figaro — job monitor backed by
job_status-*. Resource filters (job/task/worker/event), status facets, tags, durations, and a View Job link that opens the on-disk job dir in verdi's WebDAV. - On-Demand → Submit job — dropdown auto-populates with the 14 registered specs. The queue list is pulled live from RabbitMQ's admin API.
- Sources dropdown — Mozart REST Swagger, GRQ REST Swagger, OpenSearch Dashboards (Metrics), and RabbitMQ Admin.
http://<host>:8480/dashboards/login:admin/${OPENSEARCH_PASSWORD}
Primary connection is metrics-elasticsearch (logstash-*). Default
landing view has Job Metrics and Worker Metrics pre-loaded. Add
grq-elasticsearch or mozart-elasticsearch as additional data sources
from Stack Management → Data Sources to query the catalog or job status
indices from the same UI.
If a dashboard looks empty, widen the time range (top-right) — the default "Last 15 minutes" is strict. "Today" is a safe bet on a fresh install.
http://<host>:15672/login:${RABBITMQ_USER}/${RABBITMQ_PASSWORD}
The broker admin UI. Check queue depths, consumer counts, message rates — this is where to look when jobs seem "stuck."
http://<host>:5555/
Per-worker view of task throughput and in-flight work. Complements RabbitMQ's queue-centric view.
| URL | What it is |
|---|---|
http://<host>:8480/mozart/api/v0.1/ |
Mozart REST Swagger (Try-it-out works) |
http://<host>:8480/grq/api/v0.1/ |
GRQ REST Swagger |
http://<host>:8085/ |
Verdi WebDAV — browse job dirs |
When the cluster is on a remote host, the simplest way to reach the UIs is an
SSH port-forward. scripts/tunnel.sh opens all of them at once:
# generic SSH host
./scripts/tunnel.sh ssh user@hysds-host.example.com
# Google Cloud Compute Engine instance
./scripts/tunnel.sh gcloud
# override target via env:
# GCP_INSTANCE=<name> GCP_ZONE=<zone> ./scripts/tunnel.sh gcloudIt prints the local URLs with credentials auto-substituted from your .env:
HySDS UI (Figaro + Tosca) http://localhost:8480/
OpenSearch Dashboards http://localhost:5601/dashboards/ (admin / ...)
Flower (celery monitor) http://localhost:5555/
RabbitMQ Admin http://localhost:15672/ (guest / guest)
Mozart REST (Swagger) http://localhost:8888/api/v0.1/
GRQ REST (Swagger) http://localhost:8878/api/v0.1/
Verdi WebDAV (job dirs) http://localhost:8085/
Metrics UI http://localhost:8380/
Everything also works through http://localhost:8480/... — the hysds_ui
reverse proxy gives you one tunnel / one origin for all UIs.
All credentials come from .env. Both install.sh and scripts/tunnel.sh
echo them next to the URLs they apply to.
| Interface | Username | Password |
|---|---|---|
| OpenSearch Dashboards + REST / index APIs | admin |
${OPENSEARCH_PASSWORD} |
| RabbitMQ Management | ${RABBITMQ_USER} (default guest) |
${RABBITMQ_PASSWORD} (default guest) |
| Figaro / Tosca / Mozart REST / GRQ REST / Flower / Verdi WebDAV | no auth | no auth |
- RabbitMQ: edit the
.envvars, run./install.sh mozart.install.shcallsrabbitmqctl change_passwordlive — no state wipe needed. Mozart/ grq/verdi celery configs read the same env vars on next container restart so everything stays in sync. - OpenSearch: edit
OPENSEARCH_PASSWORDin.env. Either change it through the Dashboards security UI, or delete$HYSDS_HOME/{mozart,grq,metrics}/elasticsearch/dataand re-run./install.sh allto pick up the new value on a fresh data dir.
Security note: the defaults are fine for a laptop demo. For anything shared or exposed beyond localhost, rotate both before opening the ports.
hello-world/ is a minimal working template. To run your own algorithm, copy
it and change four files:
Dockerfile—FROM docker.io/hysds/pge-base:${HYSDS_VERSION}thenCOPYyour code in. Keeping this base gives you the layout HySDS expects (/home/ops→/rootsymlink,verdi/bin/activate, etc.).run_<pge>.sh— your entrypoint. The framework writes_context.jsonin the working directory with the input parameters; you produce one<dataset_id>/per output with.dataset.jsonand.met.jsoninside.hysds-io.json— the input contract shown in Figaro's submit form. Must includejob-version(mozart's on-demand endpoint requires it).job-spec.json— the job specification: command, container, required queues, disk/time limits.containerfield is mandatory.
Register the same way the smoke test registers hello-world — see
scripts/register-hello-world.sh for the three curl calls. If your PGE uses
a new queue, add a [program:<queue>] block to
config-templates/verdi/supervisord.conf and re-run ./install.sh verdi.
Submit from the CLI:
curl -X POST \
--data-urlencode "type=job-<pge>:<version>" \
--data-urlencode "queue=<queue>" \
--data-urlencode "priority=5" \
--data-urlencode 'tags=["my-test"]' \
http://localhost:8480/mozart/api/v0.1/job/submit…or submit from Figaro's On-Demand → Submit job form once registered.
Copy .env.example to .env and edit. The non-trivial knobs:
| Variable | Default | Notes |
|---|---|---|
HYSDS_VERSION |
v6.1.2 |
Tag for all hysds/* images. |
OPENSEARCH_VERSION |
2.15.0 |
OpenSearch 2.x replaces the deprecated hysds/elasticsearch. |
OPENSEARCH_PASSWORD |
HySDS!Bundle.0ffl1ne |
Must satisfy the OS 2.12+ policy: ≥8 chars, mixed case, digit, symbol, not resembling the username. |
RABBITMQ_USER / RABBITMQ_PASSWORD |
guest / guest |
Wired through to celery broker URLs in all components. Rotation is live. |
HYSDS_HOME |
$HOME/hysds_home† |
Root for etc/, log/, data/ per component. ≥30 GB. |
DATA_DIR |
${HYSDS_HOME}/verdi_data† |
Verdi job scratch. Kept at the same path on host and inside verdi so PGE containers see identical mount paths. |
HOST_UID / HOST_GID |
actual uid/gid of the running user† | Must match the user that owns podman.sock. |
XDG_RUNTIME_DIR |
/run/user/<uid>† |
Where rootless podman places its socket. |
ES_HEAP_SIZE |
2048 |
MB per OpenSearch node. Three nodes total (mozart, grq, metrics). |
METRICS_FQDN |
localhost |
CORS allow-origin for metrics UI. |
HYSDS_UI_PORT |
8480 |
Where the combined Figaro+Tosca UI listens. |
HYSDS_UI_REF |
v1.3.1 |
Git tag/branch of hysds/hysds_ui to build. |
VENUE |
HySDS |
String shown in the UI's top banner. |
*_PORT overrides |
(commented out) | Uncomment if defaults clash with something else on the host. |
†
.env.exampleships these as/home/ops/hysds_home,HOST_UID=1000,XDG_RUNTIME_DIR=/run/user/1000.scripts/bootstrap.shrewrites them on first run to match the actual user (uid,$HOME), so on a typical fresh install you don't touch them. If you skipbootstrap.shand run on a shared host, edit by hand to matchid -u/id -g/ your home dir.
hysds-quick-start/
├── README.md
├── .env.example versions, paths, ports, passwords
├── bundle.sh staging-host: pull + build + save tarball
├── install.sh install-host: verify, load, seed, compose up
├── teardown.sh stop/remove containers (keeps data)
│
├── mozart/
│ ├── Dockerfile extends hysds/mozart with logstash-output-opensearch
│ └── compose.yml mozart + rabbitmq + redis + opensearch
├── grq/compose.yml grq + redis + opensearch
├── metrics/
│ ├── Dockerfile extends hysds/metrics with logstash-output-opensearch
│ └── compose.yml metrics + redis + opensearch
├── verdi/compose.yml worker (mounts host podman.sock)
├── dashboards/
│ ├── compose.yml stock opensearchproject/opensearch-dashboards
│ ├── opensearch_dashboards.yml disables multi-tenancy + basePath config
│ └── savedobjects/ pre-imported Job/Worker Metrics dashboards
├── hysds_ui/
│ ├── Dockerfile builds the React SPA and serves via nginx
│ ├── compose.yml
│ ├── default.conf.template nginx reverse-proxy + sub_filter rules
│ └── 15-compute-basic-auth.envsh base64-encodes admin creds at container start
│
├── hello-world/ minimal PGE used by smoke-test
│ ├── Dockerfile
│ ├── run_hello_world.sh
│ ├── hysds-io.json
│ └── job-spec.json
│
├── config-templates/ overlays applied on top of image defaults
│ ├── mozart/
│ │ ├── celeryconfig.py
│ │ ├── indexer.conf logstash OpenSearch pipeline
│ │ ├── settings.cfg
│ │ ├── supervisord.conf trimmed to only what this quick-start needs
│ │ └── opensearch/ upstream *_status.template mappings
│ ├── grq/{celeryconfig.py, grq2_settings.cfg, supervisord.conf}
│ ├── verdi/{celeryconfig.py, datasets.json, supervisord.conf, …}
│ └── metrics/{celeryconfig.py, indexer.conf, supervisord.conf}
│
└── scripts/
├── bootstrap.sh host + per-user prereqs; auto-fills .env
├── register-hello-world.sh register the smoke-test PGE
├── register-lightweight-jobs.sh build + register the 13 system jobs
├── smoke-test.sh end-to-end register → submit → verify
└── tunnel.sh SSH port-forwards + credential banner
# Follow logs from one component
(cd mozart && podman compose logs -f)
# Restart a single container
podman restart verdi
# Inspect supervisord-managed processes inside a component
podman exec mozart /root/mozart/bin/supervisorctl status
podman exec verdi /root/verdi/bin/supervisorctl status
# Bring one component down (keeps data)
./teardown.sh mozart
# Bring everything down (keeps data)
./teardown.sh all
# Wipe persistent data — IRREVERSIBLE
./teardown.sh all
rm -rf "$HYSDS_HOME"| Symptom | Likely cause / fix |
|---|---|
install.sh exits with .env HOST_UID=… but real uid=… |
You're running as a different user than .env was set up for. Re-run ./scripts/bootstrap.sh to auto-fix, or hand-edit .env. |
mkdir /run/user/.../libpod: permission denied |
The user-level podman socket isn't enabled for your Linux account. ./scripts/bootstrap.sh (idempotent), or manually systemctl --user enable --now podman.socket then sudo loginctl enable-linger $USER. |
OpenSearch exits: max virtual memory areas |
sudo sysctl -w vm.max_map_count=262144 (persist via /etc/sysctl.d/). bootstrap.sh does this. |
OS on first boot: weak password |
OPENSEARCH_PASSWORD must be ≥8 chars with upper+lower+digit+symbol and not resemble admin. |
podman compose: unknown subcommand |
Install podman 4.0+, or pip install --user podman-compose. bootstrap.sh handles this. |
install.sh errors with bundle.tar.gz not found |
You're on an older install.sh that lacks the online-pull fallback — git pull and re-run. |
| Smoke test hangs in step 3 (count never increments) | podman exec verdi /root/verdi/bin/supervisorctl status — hello_world_worker should be RUNNING. Also check RabbitMQ Admin → queue → Consumers. |
| Smoke test step 4: dataset missing in grq, file exists on disk | Publish succeeded, indexing didn't. podman exec verdi tail -200 /root/verdi/log/hello_world_worker.log — look for OpenSearch auth/SSL errors. |
| On-Demand job form shows no queue options | Mozart can't reach RabbitMQ admin API. Verify RABBITMQ_USER / RABBITMQ_PASSWORD in .env match the broker. |
| Figaro tab is empty | Widen the time range (top-right). The default "Last 15 minutes" is strict. |
Lightweight-job submission fails with "Name or service not known" on mozart-elasticsearch |
Verdi is spawning PGEs off-network. Confirm network: hysds_net is in config-templates/verdi/celeryconfig.py:PODMAN_CFG.cmd_base, then ./install.sh verdi. |
| Lightweight-jobs not in Figaro's On-Demand dropdown | Their registration runs at the end of ./install.sh all. If you ran a partial install or aborted early, run ./scripts/register-lightweight-jobs.sh directly — it expects mozart and grq REST endpoints to be reachable. |
- Upstream HySDS docs: https://hysds.github.io/
- Source repos:
hysds/hysds— core engine (Celery, orchestrator, container runtime)hysds/mozart— orchestrator REST + Figaro UI backendhysds/grq2— dataset catalog REST + Tosca UI backendhysds/hysds_ui— combined Figaro+Tosca React app (we build + ship this)hysds/lightweight-jobs— the 13 system jobs (we register these)hysds/hysds-dockerfiles— image build definitions we pullhysds/hysds-framework— production multi-host installer (Puppet-based)
- Multi-host / Kubernetes: see
hysds-k8supstream.
When your own PGE runs locally here, the natural next step is a multi-host
deployment (mozart on one VM, grq on another, several verdi workers on
heterogeneous hardware) via the upstream hysds-framework installer. The
compose files and config templates in this repo use the same image tags and
config keys as production, so anything you tune here ports forward.