Skip to content

hysds/hysds-quick-start

Repository files navigation

HySDS Quick Start

A self-contained, single-host deployment of HySDS — the Hybrid Cloud Science Data System framework NASA missions use to process science data at scale. Designed to get you from git clone to a running cluster with registered jobs, populated dashboards, and a working smoke-test submission in about 30 minutes.

If you've never touched HySDS before, start here.


Table of contents


What is HySDS?

HySDS is an open-source framework for scheduling, executing, and cataloging containerized science-data jobs. NASA missions (NISAR, SWOT, OCO, OPERA…) use it to run hundreds of thousands of jobs per day across mixed on-prem + cloud infrastructure.

Problem HySDS answer
Where do jobs go? A mozart orchestrator pushes job specs onto Celery/RabbitMQ queues.
Who runs them? verdi workers pull from queues and spawn containerized PGEs.
Where do products go? The PGE writes to a configured object store; grq catalogs the metadata.
How do I see what's happening? A metrics node ingests events; Figaro, Tosca, and OpenSearch Dashboards visualize them.

A HySDS "PGE" (Product Generation Executable) is just a container with a declared command and input/output contract. HySDS knows nothing about your science — it knows how to schedule, execute, persist, and catalog containers reliably.


Architecture at a glance

A production HySDS deployment spans many hosts. This quick-start collapses everything onto one Linux host using rootless podman and podman compose.

                 ┌────────────────────────────────────────────────┐
                 │                your single host                │
                 │                                                │
   browser ────▶ │  ┌────────────────┐ ┌────────────────┐         │
                 │  │   hysds_ui     │ │  dashboards    │         │
                 │  │ Figaro + Tosca │ │  OpenSearch    │         │
                 │  │  React SPA +   │ │ + Job/Worker   │         │
                 │  │  nginx proxy   │ │  Metrics       │         │
                 │  └───────┬────────┘ └───────┬────────┘         │
                 │          │                  │                  │
                 │   /mozart/api ─▶ mozart ◀──▶│                  │
                 │   /grq/api    ─▶ grq    ◀──▶│                  │
                 │   /*_es       ─▶ opensearch─┘                  │
                 │                                                │
                 │  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
                 │  │  mozart  │  │   grq    │  │ metrics  │      │
                 │  │ orchestr │  │ catalog  │  │ host +   │      │
                 │  │ + REST + │  │ + REST + │  │ job info │      │
                 │  │ RabbitMQ │  │ OS 2.x   │  │ OS 2.x   │      │
                 │  │ OS 2.x   │  └──────────┘  └──────────┘      │
                 │  └────┬─────┘                                  │
                 │       │ celery task                            │
                 │       ▼                                        │
                 │  ┌──────────┐   spawn    ┌──────────────┐      │
                 │  │  verdi   │ ─────────▶ │  PGE         │      │
                 │  │  worker  │ podman.sock│  container   │      │
                 │  └──────────┘            └──────────────┘      │
                 │                                                │
                 │  Persistent state: $HYSDS_HOME on host fs      │
                 └────────────────────────────────────────────────┘

Components

Every component runs on the shared podman network hysds_net and addresses its peers by compose service name (e.g. mozart-rabbitmq, grq-elasticsearch).

Component Role Primary port
hysds_ui React Figaro + Tosca SPA + nginx proxy for all backends :8480
mozart Job orchestration, REST API, RabbitMQ broker, celery workers :8888, :15672
grq Dataset catalog (geo-region query) + REST :8878
metrics Host + per-job metrics aggregator with logstash indexer :9500
verdi Worker — pulls jobs, spawns PGE containers via podman.sock :8085 (WebDAV)
dashboards OpenSearch Dashboards pre-loaded with Job/Worker metrics :5601

Job lifecycle

  1. POST /job/submit          ─▶  mozart REST receives job_spec + queue + tags
  2. orchestrator_jobs worker  ─▶  validates spec, pushes celery task onto the queue
  3. RabbitMQ                  ─▶  fans task out to a verdi worker
  4. verdi                     ─▶  via podman.sock spawns the PGE container
  5. PGE                       ─▶  runs; writes <dataset>.dataset.json + .met.json
  6. verdi post-process        ─▶  publishes dataset to the configured store,
                                   indexes metadata in grq, metrics in metrics
  7. orchestrator_datasets     ─▶  evaluates user_rules to fire downstream jobs

Step 5 is the only piece you write. Everything else is the framework.


Prerequisites

Runs on one Linux host (laptop with virtualization, on-prem server, or cloud VM). macOS and Windows users should install inside a Linux VM — rootless podman on non-Linux hosts is too fiddly for a getting-started exercise.

Requirement Check
Linux x86_64, ≥ 4 vCPU, ≥ 16 GB RAM nproc, free -h
~30 GB free disk under $HYSDS_HOME df -h
podman 4.0+ podman --version
podman compose or podman-compose podman compose version
jq, curl, gzip, sha256sum dnf install jq curl gzip
vm.max_map_count ≥ 262144 (OpenSearch) sysctl vm.max_map_count
/etc/subuid, /etc/subgid for your user grep $USER /etc/subuid
User-level podman socket enabled systemctl --user status podman.socket

scripts/bootstrap.sh does this end-to-end (host packages, sysctl, subuid, podman socket, linger, podman-compose, plus a .env populated with the running user's uid / home). Idempotent — safe to re-run on the same host or as a different user.

./scripts/bootstrap.sh

If you'd rather drive the steps by hand:

sudo dnf install -y podman jq curl git python3-pip

sudo sysctl -w vm.max_map_count=262144
echo 'vm.max_map_count=262144' | sudo tee /etc/sysctl.d/99-opensearch.conf

sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $USER

systemctl --user enable --now podman.socket
sudo loginctl enable-linger $USER          # keep user services alive after logout

pip install --user podman-compose          # only if 'podman compose' is missing

Quick start

If your install host can reach docker.io, this is all you need:

git clone <this repo url> hysds-quick-start
cd hysds-quick-start

./scripts/bootstrap.sh      # one-time host + per-user prereqs; auto-fills .env
$EDITOR .env                # set OPENSEARCH_PASSWORD (everything else is auto)

./install.sh all            # ~10-20 min: image pulls + builds + compose up

./scripts/smoke-test.sh     # ~30 sec end-to-end proof

At the end, install.sh prints the UI URLs with credentials. Open http://localhost:8480/ (or forward it — see Accessing the UIs from your laptop).

Air-gapped install

If the install host has no outbound internet, build the image bundle on a staging host that does:

# === on the staging host (has docker.io access) ===
cd hysds-quick-start
cp .env.example .env
./bundle.sh                 # produces hysds-<ver>-bundle.tar.gz + .sha256

Copy the tarball, its .sha256, and this repo to the install host. Then:

# === on the install host ===
cd hysds-quick-start
./scripts/bootstrap.sh      # host + per-user prereqs; auto-fills .env
$EDITOR .env                # rotate OPENSEARCH_PASSWORD, etc. as needed
./install.sh all            # checksum-verifies, podman load, compose up
./scripts/smoke-test.sh

install.sh is idempotent — re-running it re-applies config, rebuilds any stopped components, and re-syncs state without touching persistent data.


The smoke test

./scripts/smoke-test.sh exercises the entire pipeline end-to-end:

1/4  Register hello-world specs in mozart
       POST /container/add  docker.io/hysds/hello-world:v1.0
       POST /job_spec/add   job-hello_world:v1.0
       POST /hysds_io/add   hysds-io-hello_world:v1.0    (mozart + grq)

2/4  Submit the job
       POST /job/submit     queue=hello_world_worker
       ◀── { "result": "<celery-task-id>" }

3/4  Poll grq for the published product (timeout 600s)
       baseline hello_world_product count: N
       ─── verdi pulls task; spawns PGE container
       ─── PGE writes dataset.json + met.json + 5 MB fake_data.dat
       ─── verdi publishes to file:///data/datasets/...
       ─── verdi indexes metadata in grq
       count=N+1
       new product indexed in GRQ.

4/4  Verify dataset landed in grq
       SUCCESS: N+1 hello_world_product dataset(s) indexed.

Smoke test passed.

When this prints Smoke test passed. you have a fully working HySDS cluster: registered PGE, successful job execution, catalog entry on-disk at $HYSDS_HOME/datasets/hello_world_product/v1.0/<id>/:

hello_world-product-<ts>-<hash>.dataset.json
hello_world-product-<ts>-<hash>.met.json
_publish.context.json
fake_data.dat

What ships pre-configured

After install.sh all finishes, the cluster already has:

  • 14 registered job specs ready for Figaro's On-Demand form: hello_world plus all 13 upstream HySDS system jobs (lw-mozart-{purge,retry,revoke,reprioritize,notify-by-email}, lw-tosca-{aws_get,notify-by-email,purge,wget,wget-email,wget-glob,wget-product}, lightweight-echo).

  • Two OpenSearch Dashboards (courtesy of upstream sdscli):

    • Job Metrics — throughput, duration distributions, failed jobs, container/version churn.
    • Worker Metrics — worker heartbeat timeline, tasks-per-worker, per-host CPU / memory / disk from instance_stats.
  • OpenSearch index templates + aliases on mozart's cluster for job_status, task_status, worker_status, event_status. Polymorphic subtrees (job.params, job.context, event) are stored in _source but not indexed, so mapping doesn't conflict when different job types arrive.

  • Workers for every queue mozart routes to: hello_world_worker, system-jobs-queue (verdi), and orchestrator_{jobs,datasets}, user_rules_{job,dataset,trigger}, process_events_tasks, on_demand_{job,dataset} (mozart).

  • Logstash indexers running on mozart (job/task/worker/event status) and metrics (logstash-YYYY.MM.dd for the dashboards), using the OpenSearch output plugin over HTTPS with admin auth.

  • Unified origin at :8480 — the hysds_ui nginx proxies /mozart/, /grq/, /mozart_es/, /grq_es/, /dashboards/, /swaggerui/, and /verdi/ to the right backends. One tunnel / one port is enough for day-to-day use.


The HySDS UIs

HySDS UI — Figaro + Tosca

http://<host>:8480/

Combined React SPA. Top-bar nav:

  • Tosca — faceted dataset browser backed by grq_* indices. Map view, metadata viewer, on-demand submissions against selected result sets.
  • Figaro — job monitor backed by job_status-*. Resource filters (job / task / worker / event), status facets, tags, durations, and a View Job link that opens the on-disk job dir in verdi's WebDAV.
  • On-Demand → Submit job — dropdown auto-populates with the 14 registered specs. The queue list is pulled live from RabbitMQ's admin API.
  • Sources dropdown — Mozart REST Swagger, GRQ REST Swagger, OpenSearch Dashboards (Metrics), and RabbitMQ Admin.

OpenSearch Dashboards — metrics explorer

http://<host>:8480/dashboards/   login: admin / ${OPENSEARCH_PASSWORD}

Primary connection is metrics-elasticsearch (logstash-*). Default landing view has Job Metrics and Worker Metrics pre-loaded. Add grq-elasticsearch or mozart-elasticsearch as additional data sources from Stack Management → Data Sources to query the catalog or job status indices from the same UI.

If a dashboard looks empty, widen the time range (top-right) — the default "Last 15 minutes" is strict. "Today" is a safe bet on a fresh install.

RabbitMQ Management — queue health

http://<host>:15672/   login: ${RABBITMQ_USER} / ${RABBITMQ_PASSWORD}

The broker admin UI. Check queue depths, consumer counts, message rates — this is where to look when jobs seem "stuck."

Flower — celery monitor

http://<host>:5555/

Per-worker view of task throughput and in-flight work. Complements RabbitMQ's queue-centric view.

Bonus surfaces

URL What it is
http://<host>:8480/mozart/api/v0.1/ Mozart REST Swagger (Try-it-out works)
http://<host>:8480/grq/api/v0.1/ GRQ REST Swagger
http://<host>:8085/ Verdi WebDAV — browse job dirs

Accessing the UIs from your laptop

When the cluster is on a remote host, the simplest way to reach the UIs is an SSH port-forward. scripts/tunnel.sh opens all of them at once:

# generic SSH host
./scripts/tunnel.sh ssh user@hysds-host.example.com

# Google Cloud Compute Engine instance
./scripts/tunnel.sh gcloud
# override target via env:
#   GCP_INSTANCE=<name> GCP_ZONE=<zone> ./scripts/tunnel.sh gcloud

It prints the local URLs with credentials auto-substituted from your .env:

  HySDS UI (Figaro + Tosca)     http://localhost:8480/
  OpenSearch Dashboards         http://localhost:5601/dashboards/   (admin / ...)
  Flower (celery monitor)       http://localhost:5555/
  RabbitMQ Admin                http://localhost:15672/             (guest / guest)
  Mozart REST (Swagger)         http://localhost:8888/api/v0.1/
  GRQ REST (Swagger)            http://localhost:8878/api/v0.1/
  Verdi WebDAV (job dirs)       http://localhost:8085/
  Metrics UI                    http://localhost:8380/

Everything also works through http://localhost:8480/... — the hysds_ui reverse proxy gives you one tunnel / one origin for all UIs.


Credentials

All credentials come from .env. Both install.sh and scripts/tunnel.sh echo them next to the URLs they apply to.

Interface Username Password
OpenSearch Dashboards + REST / index APIs admin ${OPENSEARCH_PASSWORD}
RabbitMQ Management ${RABBITMQ_USER} (default guest) ${RABBITMQ_PASSWORD} (default guest)
Figaro / Tosca / Mozart REST / GRQ REST / Flower / Verdi WebDAV no auth no auth

Rotating

  • RabbitMQ: edit the .env vars, run ./install.sh mozart. install.sh calls rabbitmqctl change_password live — no state wipe needed. Mozart/ grq/verdi celery configs read the same env vars on next container restart so everything stays in sync.
  • OpenSearch: edit OPENSEARCH_PASSWORD in .env. Either change it through the Dashboards security UI, or delete $HYSDS_HOME/{mozart,grq,metrics}/elasticsearch/data and re-run ./install.sh all to pick up the new value on a fresh data dir.

Security note: the defaults are fine for a laptop demo. For anything shared or exposed beyond localhost, rotate both before opening the ports.


Bringing your own PGE

hello-world/ is a minimal working template. To run your own algorithm, copy it and change four files:

  1. DockerfileFROM docker.io/hysds/pge-base:${HYSDS_VERSION} then COPY your code in. Keeping this base gives you the layout HySDS expects (/home/ops/root symlink, verdi/bin/activate, etc.).
  2. run_<pge>.sh — your entrypoint. The framework writes _context.json in the working directory with the input parameters; you produce one <dataset_id>/ per output with .dataset.json and .met.json inside.
  3. hysds-io.json — the input contract shown in Figaro's submit form. Must include job-version (mozart's on-demand endpoint requires it).
  4. job-spec.json — the job specification: command, container, required queues, disk/time limits. container field is mandatory.

Register the same way the smoke test registers hello-world — see scripts/register-hello-world.sh for the three curl calls. If your PGE uses a new queue, add a [program:<queue>] block to config-templates/verdi/supervisord.conf and re-run ./install.sh verdi.

Submit from the CLI:

curl -X POST \
  --data-urlencode "type=job-<pge>:<version>" \
  --data-urlencode "queue=<queue>" \
  --data-urlencode "priority=5" \
  --data-urlencode 'tags=["my-test"]' \
  http://localhost:8480/mozart/api/v0.1/job/submit

…or submit from Figaro's On-Demand → Submit job form once registered.


Configuration reference (.env)

Copy .env.example to .env and edit. The non-trivial knobs:

Variable Default Notes
HYSDS_VERSION v6.1.2 Tag for all hysds/* images.
OPENSEARCH_VERSION 2.15.0 OpenSearch 2.x replaces the deprecated hysds/elasticsearch.
OPENSEARCH_PASSWORD HySDS!Bundle.0ffl1ne Must satisfy the OS 2.12+ policy: ≥8 chars, mixed case, digit, symbol, not resembling the username.
RABBITMQ_USER / RABBITMQ_PASSWORD guest / guest Wired through to celery broker URLs in all components. Rotation is live.
HYSDS_HOME $HOME/hysds_home Root for etc/, log/, data/ per component. ≥30 GB.
DATA_DIR ${HYSDS_HOME}/verdi_data Verdi job scratch. Kept at the same path on host and inside verdi so PGE containers see identical mount paths.
HOST_UID / HOST_GID actual uid/gid of the running user† Must match the user that owns podman.sock.
XDG_RUNTIME_DIR /run/user/<uid> Where rootless podman places its socket.
ES_HEAP_SIZE 2048 MB per OpenSearch node. Three nodes total (mozart, grq, metrics).
METRICS_FQDN localhost CORS allow-origin for metrics UI.
HYSDS_UI_PORT 8480 Where the combined Figaro+Tosca UI listens.
HYSDS_UI_REF v1.3.1 Git tag/branch of hysds/hysds_ui to build.
VENUE HySDS String shown in the UI's top banner.
*_PORT overrides (commented out) Uncomment if defaults clash with something else on the host.

.env.example ships these as /home/ops/hysds_home, HOST_UID=1000, XDG_RUNTIME_DIR=/run/user/1000. scripts/bootstrap.sh rewrites them on first run to match the actual user (uid, $HOME), so on a typical fresh install you don't touch them. If you skip bootstrap.sh and run on a shared host, edit by hand to match id -u / id -g / your home dir.


Repository layout

hysds-quick-start/
├── README.md
├── .env.example                    versions, paths, ports, passwords
├── bundle.sh                       staging-host: pull + build + save tarball
├── install.sh                      install-host: verify, load, seed, compose up
├── teardown.sh                     stop/remove containers (keeps data)
│
├── mozart/
│   ├── Dockerfile                  extends hysds/mozart with logstash-output-opensearch
│   └── compose.yml                 mozart + rabbitmq + redis + opensearch
├── grq/compose.yml                 grq + redis + opensearch
├── metrics/
│   ├── Dockerfile                  extends hysds/metrics with logstash-output-opensearch
│   └── compose.yml                 metrics + redis + opensearch
├── verdi/compose.yml               worker (mounts host podman.sock)
├── dashboards/
│   ├── compose.yml                 stock opensearchproject/opensearch-dashboards
│   ├── opensearch_dashboards.yml   disables multi-tenancy + basePath config
│   └── savedobjects/               pre-imported Job/Worker Metrics dashboards
├── hysds_ui/
│   ├── Dockerfile                  builds the React SPA and serves via nginx
│   ├── compose.yml
│   ├── default.conf.template       nginx reverse-proxy + sub_filter rules
│   └── 15-compute-basic-auth.envsh base64-encodes admin creds at container start
│
├── hello-world/                    minimal PGE used by smoke-test
│   ├── Dockerfile
│   ├── run_hello_world.sh
│   ├── hysds-io.json
│   └── job-spec.json
│
├── config-templates/               overlays applied on top of image defaults
│   ├── mozart/
│   │   ├── celeryconfig.py
│   │   ├── indexer.conf            logstash OpenSearch pipeline
│   │   ├── settings.cfg
│   │   ├── supervisord.conf        trimmed to only what this quick-start needs
│   │   └── opensearch/             upstream *_status.template mappings
│   ├── grq/{celeryconfig.py, grq2_settings.cfg, supervisord.conf}
│   ├── verdi/{celeryconfig.py, datasets.json, supervisord.conf, …}
│   └── metrics/{celeryconfig.py, indexer.conf, supervisord.conf}
│
└── scripts/
    ├── bootstrap.sh                host + per-user prereqs; auto-fills .env
    ├── register-hello-world.sh     register the smoke-test PGE
    ├── register-lightweight-jobs.sh build + register the 13 system jobs
    ├── smoke-test.sh               end-to-end register → submit → verify
    └── tunnel.sh                   SSH port-forwards + credential banner

Operations

# Follow logs from one component
(cd mozart && podman compose logs -f)

# Restart a single container
podman restart verdi

# Inspect supervisord-managed processes inside a component
podman exec mozart /root/mozart/bin/supervisorctl status
podman exec verdi  /root/verdi/bin/supervisorctl  status

# Bring one component down (keeps data)
./teardown.sh mozart

# Bring everything down (keeps data)
./teardown.sh all

# Wipe persistent data — IRREVERSIBLE
./teardown.sh all
rm -rf "$HYSDS_HOME"

Troubleshooting

Symptom Likely cause / fix
install.sh exits with .env HOST_UID=… but real uid=… You're running as a different user than .env was set up for. Re-run ./scripts/bootstrap.sh to auto-fix, or hand-edit .env.
mkdir /run/user/.../libpod: permission denied The user-level podman socket isn't enabled for your Linux account. ./scripts/bootstrap.sh (idempotent), or manually systemctl --user enable --now podman.socket then sudo loginctl enable-linger $USER.
OpenSearch exits: max virtual memory areas sudo sysctl -w vm.max_map_count=262144 (persist via /etc/sysctl.d/). bootstrap.sh does this.
OS on first boot: weak password OPENSEARCH_PASSWORD must be ≥8 chars with upper+lower+digit+symbol and not resemble admin.
podman compose: unknown subcommand Install podman 4.0+, or pip install --user podman-compose. bootstrap.sh handles this.
install.sh errors with bundle.tar.gz not found You're on an older install.sh that lacks the online-pull fallback — git pull and re-run.
Smoke test hangs in step 3 (count never increments) podman exec verdi /root/verdi/bin/supervisorctl statushello_world_worker should be RUNNING. Also check RabbitMQ Admin → queue → Consumers.
Smoke test step 4: dataset missing in grq, file exists on disk Publish succeeded, indexing didn't. podman exec verdi tail -200 /root/verdi/log/hello_world_worker.log — look for OpenSearch auth/SSL errors.
On-Demand job form shows no queue options Mozart can't reach RabbitMQ admin API. Verify RABBITMQ_USER / RABBITMQ_PASSWORD in .env match the broker.
Figaro tab is empty Widen the time range (top-right). The default "Last 15 minutes" is strict.
Lightweight-job submission fails with "Name or service not known" on mozart-elasticsearch Verdi is spawning PGEs off-network. Confirm network: hysds_net is in config-templates/verdi/celeryconfig.py:PODMAN_CFG.cmd_base, then ./install.sh verdi.
Lightweight-jobs not in Figaro's On-Demand dropdown Their registration runs at the end of ./install.sh all. If you ran a partial install or aborted early, run ./scripts/register-lightweight-jobs.sh directly — it expects mozart and grq REST endpoints to be reachable.

Going further

  • Upstream HySDS docs: https://hysds.github.io/
  • Source repos:
    • hysds/hysds — core engine (Celery, orchestrator, container runtime)
    • hysds/mozart — orchestrator REST + Figaro UI backend
    • hysds/grq2 — dataset catalog REST + Tosca UI backend
    • hysds/hysds_ui — combined Figaro+Tosca React app (we build + ship this)
    • hysds/lightweight-jobs — the 13 system jobs (we register these)
    • hysds/hysds-dockerfiles — image build definitions we pull
    • hysds/hysds-framework — production multi-host installer (Puppet-based)
  • Multi-host / Kubernetes: see hysds-k8s upstream.

When your own PGE runs locally here, the natural next step is a multi-host deployment (mozart on one VM, grq on another, several verdi workers on heterogeneous hardware) via the upstream hysds-framework installer. The compose files and config templates in this repo use the same image tags and config keys as production, so anything you tune here ports forward.

About

Turnkey single-host HySDS cluster — rootless podman + OpenSearch + combined Figaro/Tosca UI, with an end-to-end hello-world PGE smoke test.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors