Ruit/joyang/dynamo rollout poc#1
Draft
RayenTian wants to merge 7 commits into
Draft
Conversation
…us monitoring
Adds the rollout-only Dynamo path layer on top of dynamo-k8s-integration:
* nemo_rl/models/generation/dynamo/monitoring/ — new subpackage:
- prometheus.py: scrapes the DGD's worker /metrics endpoint(s) on a
background thread, writes raw_scrapes.jsonl + samples.jsonl + an
OpenMetrics-formatted dump for offline Prometheus TSDB replay.
- grafana.py + grafana_dashboard_template.json: builds a Grafana
dashboard clipped to the metrics actually captured (panels referring
to unseen PromQL series are dropped), with the export's time window
baked into the default range.
- __init__.py.
* examples/nemo_gym/run_dynamo_rollout_only.py — narrower entrypoint than
run_grpo_nemo_gym.py. Skips the train policy / logprob stacks so a
smoke can reserve all GPUs for Dynamo rollout serving, hooks in the
Prometheus exporter via maybe_start_dynamo_prometheus_monitor.
* examples/nemo_gym/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml —
recipe wiring policy.generation.backend=dynamo + the Prometheus
exporter config; mini-swe-agent rollout config under env.nemo_gym.
* infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml
— RayCluster + DGD pair. RayCluster's gpu-workers are sized to 0
because run_dynamo_rollout_only.py doesn't need a train cluster.
Per-user values (`${user:}`) cover the KAI queue, RayCluster name,
DGD name, log/metrics dirs, and the HF_HOME injected into the worker
via dynamo.serving.overrides (DGD YAML itself is loaded as plain
YAML, so no OmegaConf interpolation there). The entrypoint sets
git safe.directory to '*' so PVC-rsync'd repos don't trip git's
"dubious ownership" check, then post-runs co-locates the before/after
prom snapshots into the driver's exp_NNN dir.
* infra/nrl_k8s/examples_dgd/qwen3_30b_a3b_instruct_2507_gb300.yaml
— DGD manifest. 1x VllmDecodeWorker on GB300, served via the dynamo
frontend. Replicas/GPU/TP are designed to be bumped uniformly via the
infra YAML's dynamo.serving.overrides; the manifest itself is fully
user-agnostic.
Companion edits to existing files (additive, no behavior change for
existing recipes):
* nemo_rl/models/generation/dynamo/config.py — add
`prometheus_metrics: NotRequired[dict[str, Any]]` to DynamoCfg.
* nemo_rl/utils/logger.py — add `group` + `job_type` NotRequired fields
to WandbConfig.
* tests/unit/models/generation/test_dynamo_prometheus.py — pure-mock
unit tests for the Prometheus monitor's collection + export shape.
Smoke-validated end-to-end on GB300 against jthomson04/dynamo-k8s-integration
tip: 1 SWE-bench Verified sample / step_limit=1 / 1 GPU / TP=1, ~81s
rollout. Before/after vllm prom snapshots confirm real traffic reached
vllm via the Dynamo frontend (vllm:prompt_tokens_total 0→1535,
vllm:generation_tokens_total 0→69).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Jonas Yang <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Three independent nrl-k8s bugs surfaced while bringing the mini-swe Qwen3-30B-A3B-Instruct-2507 Dynamo rollout up on the GB300 customer-cpu fleet under AWS SSO. None affects steady-state training; together they're what kept the smoke from succeeding. 1. `k8s.py` — kubernetes-client 36.0.0 changed the auth header key between `load_kube_config()` (writes `authorization`) and `auth_settings()` (reads `BearerToken`). `load_kubeconfig()` now aliases the bearer token under both keys so SDK calls don't 401 immediately after a fresh `aws eks get-token`. 2. `cli.py` — `nrl-k8s job list` crashed with `TypeError: unsupported format string passed to NoneType.__format__` when an exec-submitter run had no submission_id. Fall back to "(driver)" in the formatter. 3. `dgd.py` — `_frontend_http_ready` treats HTTP 403 from the API server's services/proxy verb as ready. The smoke account's AWS SSO role legitimately lacks the proxy RBAC verb, but the upstream readiness gates (DGD status, Service endpoints) already confirm the frontend is healthy. 403 here is "not allowed to peek", not "not ready", and should not block the wait. 4. `orchestrate.py` — `bring_up_cluster` / `ensure_cluster` `ready_timeout_s` default goes 900 → 1800. The customer-cpu arm64 nodegroup has a postStart hook doing `apt-get install singularity-container squashfuse`, which routinely takes 15-20 min on the public Ubuntu arm64 mirror. The old 15-minute deadline was inside that window; the new 30-minute one isn't. Signed-off-by: ruit <ruit@nvidia.com>
… dashboard fidelity + entrypoint hardening
Iterates on the initial Dynamo Prometheus monitoring POC to make the
offline Grafana bundle (data.openmetrics + grafana-dashboard.json +
prometheus-offline.yml) actually self-sufficient and self-explanatory:
every panel in the bundled dashboard either renders real data or is
honestly pruned at export time, with no silent "looks empty" panels
from missing scrape coverage or upstream API drift. Also rolls in the
mini-swe Dynamo rollout entrypoint hardening that surfaced alongside
this work — guards against image / uv.lock drift that bit us on every
fresh head pod.
* Dashboard template: swap the legacy POC dashboard for the jonas
vllm-focused build that has all 8 rows / 48 panels we care about
(Deployment Config / Overview / Throughput {bytes,tokens} / Latency
/ KV Cache & Queue Dynamics / Workload Characterization / GPU
(DCGM)). Templated via `$pod_regex` so the same JSON works across
per-user DGDs. Strip the `__inputs` / `__requires` import-wizard
metadata so post-replay imports don't fall into Grafana's
interactive datasource-resolution path with already-rewritten UIDs.
* Prometheus monitor (`prometheus.py`):
- Add `extra_endpoints` so we can scrape cluster-wide DaemonSets
(e.g. `nvidia-dcgm-exporter.gpu-operator:9400`) alongside the
DGD's own services. Required to populate the GPU (DCGM) row,
which the existing service-template URL can't synthesize.
- Let each `service_names` entry carry an optional `:port`
suffix. The Dynamo frontend exposes config-style gauges
(`dynamo_frontend_model_context_length`,
`dynamo_frontend_model_max_num_batched_tokens`,
`dynamo_frontend_model_max_num_seqs`,
`dynamo_frontend_model_total_kv_blocks`) on the 8000 HTTP API
port, not the 9090 backend metrics port. With per-service
ports, the recipe can keep using the convenience template for
both `vllmdecodeworker` (9090) and `frontend:8000` instead of
hand-rolling a fully-qualified extra_endpoints URL that bakes
in the per-user DGD name.
- Extend `DEFAULT_METRIC_PREFIXES` with `vllm:`, `sglang:`,
`trtllm_`, and `DCGM_` so the include filter doesn't drop any
backend metric just because the prefix wasn't predeclared.
- Coerce dict/BaseModel at the entry point of
`maybe_start_dynamo_prometheus_monitor` so the same code path
keeps working after upstream PR NVIDIA-NeMo#2325 swapped `MasterConfig`
from a TypedDict to a Pydantic model.
* Recipe (`grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml`):
- Remove the `metric_prefixes: ["dynamo_"]` override that was
silently filtering out every `vllm:*`, `sglang:*`, and
`trtllm_*` sample before they reached `samples.jsonl` (raw
scrapes still had them; the dashboard didn't).
- Add `extra_endpoints` pointing at the in-cluster DCGM
exporter.
- Add `frontend:8000` to `service_names` so the frontend
`dynamo_frontend_*` gauges actually get scraped.
- Set `include_histogram_buckets: true` so `*_bucket` samples
flow into `data.openmetrics` and `export_metric_names`. Without
this, the Latency panels (TTFT / ITL / E2E / Time in Queue)
and the Workload Characterization heatmaps all reference
`histogram_quantile(rate(*_bucket[2m]))` series that aren't in
the export set, and the dashboard exporter's
`_filter_grafana_panels` correctly prunes them — so they end
up missing from the offline bundle even though the underlying
vllm scrape did emit them.
* `examples/nemo_gym/run_dynamo_rollout_only.py`: adapt the
rollout-only driver to the new `MasterConfig` BaseModel API
(attribute access on policy/env/data/logger, `OmegaConf.to_container`
+ `MasterConfig(**config)` instantiation) so the upstream rename
doesn't crash the smoke immediately at config load.
* `tests/unit/models/generation/test_dynamo_prometheus.py`: matching
test updates for the dashboard / config changes.
* Entrypoint hardening for the GB300 customer-cpu mini-swe rollout
infra YAML (`infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml`):
- `git config --global --add safe.directory '*'` so the
PVC-rsync'd checkout, owned by the host uid that wrote it,
doesn't trip "dubious ownership" inside a pod running as root.
- `uv sync --frozen` + `NRL_FORCE_REBUILD_VENVS=true` so the
container venv catches up to the PVC's `uv.lock` when the baked
image's pinned deps drift from the synced checkout (e.g. the
`tensordict` dependency that landed via PR NVIDIA-NeMo#2439 after the
`664d29c-49528955` image was baked).
- Explicit fail-fast preflight on `/mnt/rl-workspace` mount, the
recipe yaml's presence on the synced checkout, and `singularity`
being on PATH after postStart. Plain `set -eu` would surface
these as confusing later errors; the explicit checks turn them
into one-line diagnostics.
Verified against exp_011 on the GB300 smoke: 3 endpoints scraping
cleanly (vllmdecodeworker, frontend, dcgm — 12 successful scrapes
each), `data.openmetrics` has DCGM 1874 / vllm 4104 (with 3396
bucket lines) / dynamo_frontend 2556, exported dashboard 48 / 48
panels with 0 pruned.
Co-Authored-By: Joyang Yan <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…is pipeline
Adds a new `examples/swe_bench/` tree where each subfolder is a complete,
checked-in spec for one SWE-bench rollout experiment, paired with an
`analysis.py` post-processor that turns the raw rollout output into a
human-readable summary + a per-LLM-call jsonl that's portable to other
tools. Anyone in the same namespace can reproduce a run by name with one
command instead of stitching together a recipe, an nrl-k8s infra, and a
DGD manifest spread across three different repo paths, and can
re-aggregate any past exp_NNN offline without re-running the rollout.
* `examples/swe_bench/run_test.sh` — driver wrapper that takes an
experiment name as a positional arg, resolves it to
`<exp>/recipe.yaml` + `<exp>/infra.gb300.yaml` (the infra YAML in turn
references its sibling `dgd.gb300.yaml`), and walks the standard
reinstall → check → run → tail → teardown cycle. PVC rsync is
deliberately *not* part of the script — silent auto-sync turned out
to be the failure mode behind a long debugging session where a misset
`RL_ROOT` shipped `examples/nemo_gym/` to PVC's top level and
produced a namespace-package that shadowed Gym's real `nemo_gym`
import at runtime. Operators sync the checkout themselves before
invoking the script.
* `examples/swe_bench/qwen3_30b_a3b_instruct_2507/` — first experiment.
Production rollout against the SWE-bench Verified arm64 subset with
mini_swe_agent step_limit=50, concurrency=16, and a 4 worker × 2 GPU
× TP=2 DGD on the GB300 customer-cpu fleet. The recipe references
`../../nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml`
as its `defaults:` base, and the infra entrypoint runs the recipe
with only two per-user Hydra injects (`dgd_name`, `frontend_port`).
* `examples/swe_bench/README.md` — folder convention, prereqs (auth,
HF cache, SIF cache, dataset row count, KAI queue quota, smoke
cluster down), the one-line run command, and the verify-after-finish
checklist. Includes guidance for adding new experiments by copying
an existing folder and editing the three YAMLs in place.
* `examples/swe_bench/__init__.py` + `examples/swe_bench/analysis.py` —
post-rollout analyzer runnable both as a library function (called
from the driver's archive step via `from examples.swe_bench.analysis
import write_summary`) and standalone for offline re-aggregation
(`python examples/swe_bench/analysis.py /path/to/exp_NNN`). Reads:
- `exp_dir/trajectory_collection.jsonl` (NeMo-RL row-per-instance)
- `exp_dir/trajectories/results/verified/<model>/<instance>/*.traj.json`
and `report_*.json` (mini_swe_agent's per-instance native dump)
- `exp_dir/qwen*-{before,after}-*.prom` (optional; vllm snapshots
the entrypoint curls around the rollout)
Writes:
- `summary.txt` — five-bucket mutually-exclusive outcome partition
(resolved / no_report / patch_failed_apply / empty_patch /
wrong_patch — every instance lands in exactly one), independent
`exit_status` enum flags (Submitted / LimitsExceeded /
CollapseContinued / RetryError / etc., harvested from the data
so new enums show up automatically), `truncation_in_any_response`
count, per-repo breakdown, token + agent-step aggregates with
p50/p95 distributions, latency percentiles (min / p10 / p50 / p90
/ p95 / p99 / max / avg), slow-tail (>5 s) request count + excess
time, and a server-side time breakdown from the vllm prom diff
(Σ TTFT vs Σ e2e split into prefill / decode, with an
"implied replicas" multiplier so multi-worker DGD users can
reconstruct the aggregate from the single-pod snapshot).
- `tool_call_timings.jsonl` — one row per LLM API call with token
usage, measured `llm_time_ms` (vllm nvext.timing.total_time_ms),
derived `tool_time_s` (gap to the next call — agent-side bash +
parsing), and `est_prefill_ms` / `est_decode_ms` /
`est_itl_ms_used` columns derived from the run's global ITL
(vllm responses don't expose per-request TTFT, so this split is
necessarily an estimate — the `est_` prefix is explicit).
* `examples/nemo_gym/run_dynamo_rollout_only.py` —
`_dump_swe_archive_layout` decorates exp_NNN/ after the existing
`_log_trajectory_collection` call so the experiment bundle matches
the standalone mini_swe_agent archive shape (see e.g.
`script/dynamo/SWE/results/qwen3-30b-instruct-2507-vllm-v3-*`).
Three artifacts:
- `trajectories/` — real dir holding an absolute-path `results`
symlink to mini_swe_agent's native
`responses_api_agents/.../results/` tree, so per-instance
`.traj.json` / `report.json` / `patch.diff` / `run_instance.log`
/ `test_output.txt` are reachable from one place. Symlink (not
copy) keeps the bundle live and ~free of storage cost; the
`results/` middle layer matches the standalone archive convention.
- `manifests/` — copy of the recipe + sibling infra/DGD YAMLs so
the exp_NNN bundle stays self-describing even if the checked-in
files drift afterward. Also lets `analysis.py` autodetect the
model when re-aggregated offline.
- `summary.txt` + `tool_call_timings.jsonl` via the shared
`analysis.write_summary`, wrapped in try/except so an analysis
failure can't crash the driver after the trajectories/ and
manifests/ steps already succeeded.
Validated by re-aggregating jthomson04's 2026-05-19 production archive
(281 instances) — the partition counts (33 resolved + 1 no_report + 1
patch_failed_apply + 91 empty_patch + 155 wrong_patch = 281, four
exit_status enums, 10 truncation instances) match the per-file probe
on the archive — and by running offline against our own PVC exp_003
(16-instance smoke) where the implied-replicas multiplier comes out
to 4.23, matching the actual production 4-worker DGD.
Co-Authored-By: Joyang Yan <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
b585267 to
f9ab8e4
Compare
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information