Ruit/joyang/dynamo rollout poc by RayenTian · Pull Request #1 · jthomson04/RL

RayenTian · 2026-05-26T04:58:45Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

…us monitoring Adds the rollout-only Dynamo path layer on top of dynamo-k8s-integration: * nemo_rl/models/generation/dynamo/monitoring/ — new subpackage: - prometheus.py: scrapes the DGD's worker /metrics endpoint(s) on a background thread, writes raw_scrapes.jsonl + samples.jsonl + an OpenMetrics-formatted dump for offline Prometheus TSDB replay. - grafana.py + grafana_dashboard_template.json: builds a Grafana dashboard clipped to the metrics actually captured (panels referring to unseen PromQL series are dropped), with the export's time window baked into the default range. - __init__.py. * examples/nemo_gym/run_dynamo_rollout_only.py — narrower entrypoint than run_grpo_nemo_gym.py. Skips the train policy / logprob stacks so a smoke can reserve all GPUs for Dynamo rollout serving, hooks in the Prometheus exporter via maybe_start_dynamo_prometheus_monitor. * examples/nemo_gym/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml — recipe wiring policy.generation.backend=dynamo + the Prometheus exporter config; mini-swe-agent rollout config under env.nemo_gym. * infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml — RayCluster + DGD pair. RayCluster's gpu-workers are sized to 0 because run_dynamo_rollout_only.py doesn't need a train cluster. Per-user values (`${user:}`) cover the KAI queue, RayCluster name, DGD name, log/metrics dirs, and the HF_HOME injected into the worker via dynamo.serving.overrides (DGD YAML itself is loaded as plain YAML, so no OmegaConf interpolation there). The entrypoint sets git safe.directory to '*' so PVC-rsync'd repos don't trip git's "dubious ownership" check, then post-runs co-locates the before/after prom snapshots into the driver's exp_NNN dir. * infra/nrl_k8s/examples_dgd/qwen3_30b_a3b_instruct_2507_gb300.yaml — DGD manifest. 1x VllmDecodeWorker on GB300, served via the dynamo frontend. Replicas/GPU/TP are designed to be bumped uniformly via the infra YAML's dynamo.serving.overrides; the manifest itself is fully user-agnostic. Companion edits to existing files (additive, no behavior change for existing recipes): * nemo_rl/models/generation/dynamo/config.py — add `prometheus_metrics: NotRequired[dict[str, Any]]` to DynamoCfg. * nemo_rl/utils/logger.py — add `group` + `job_type` NotRequired fields to WandbConfig. * tests/unit/models/generation/test_dynamo_prometheus.py — pure-mock unit tests for the Prometheus monitor's collection + export shape. Smoke-validated end-to-end on GB300 against jthomson04/dynamo-k8s-integration tip: 1 SWE-bench Verified sample / step_limit=1 / 1 GPU / TP=1, ~81s rollout. Before/after vllm prom snapshots confirm real traffic reached vllm via the Dynamo frontend (vllm:prompt_tokens_total 0→1535, vllm:generation_tokens_total 0→69). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Jonas Yang <joyang@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com>

Three independent nrl-k8s bugs surfaced while bringing the mini-swe Qwen3-30B-A3B-Instruct-2507 Dynamo rollout up on the GB300 customer-cpu fleet under AWS SSO. None affects steady-state training; together they're what kept the smoke from succeeding. 1. `k8s.py` — kubernetes-client 36.0.0 changed the auth header key between `load_kube_config()` (writes `authorization`) and `auth_settings()` (reads `BearerToken`). `load_kubeconfig()` now aliases the bearer token under both keys so SDK calls don't 401 immediately after a fresh `aws eks get-token`. 2. `cli.py` — `nrl-k8s job list` crashed with `TypeError: unsupported format string passed to NoneType.__format__` when an exec-submitter run had no submission_id. Fall back to "(driver)" in the formatter. 3. `dgd.py` — `_frontend_http_ready` treats HTTP 403 from the API server's services/proxy verb as ready. The smoke account's AWS SSO role legitimately lacks the proxy RBAC verb, but the upstream readiness gates (DGD status, Service endpoints) already confirm the frontend is healthy. 403 here is "not allowed to peek", not "not ready", and should not block the wait. 4. `orchestrate.py` — `bring_up_cluster` / `ensure_cluster` `ready_timeout_s` default goes 900 → 1800. The customer-cpu arm64 nodegroup has a postStart hook doing `apt-get install singularity-container squashfuse`, which routinely takes 15-20 min on the public Ubuntu arm64 mirror. The old 15-minute deadline was inside that window; the new 30-minute one isn't. Signed-off-by: ruit <ruit@nvidia.com>

… dashboard fidelity + entrypoint hardening Iterates on the initial Dynamo Prometheus monitoring POC to make the offline Grafana bundle (data.openmetrics + grafana-dashboard.json + prometheus-offline.yml) actually self-sufficient and self-explanatory: every panel in the bundled dashboard either renders real data or is honestly pruned at export time, with no silent "looks empty" panels from missing scrape coverage or upstream API drift. Also rolls in the mini-swe Dynamo rollout entrypoint hardening that surfaced alongside this work — guards against image / uv.lock drift that bit us on every fresh head pod. * Dashboard template: swap the legacy POC dashboard for the jonas vllm-focused build that has all 8 rows / 48 panels we care about (Deployment Config / Overview / Throughput {bytes,tokens} / Latency / KV Cache & Queue Dynamics / Workload Characterization / GPU (DCGM)). Templated via `$pod_regex` so the same JSON works across per-user DGDs. Strip the `__inputs` / `__requires` import-wizard metadata so post-replay imports don't fall into Grafana's interactive datasource-resolution path with already-rewritten UIDs. * Prometheus monitor (`prometheus.py`): - Add `extra_endpoints` so we can scrape cluster-wide DaemonSets (e.g. `nvidia-dcgm-exporter.gpu-operator:9400`) alongside the DGD's own services. Required to populate the GPU (DCGM) row, which the existing service-template URL can't synthesize. - Let each `service_names` entry carry an optional `:port` suffix. The Dynamo frontend exposes config-style gauges (`dynamo_frontend_model_context_length`, `dynamo_frontend_model_max_num_batched_tokens`, `dynamo_frontend_model_max_num_seqs`, `dynamo_frontend_model_total_kv_blocks`) on the 8000 HTTP API port, not the 9090 backend metrics port. With per-service ports, the recipe can keep using the convenience template for both `vllmdecodeworker` (9090) and `frontend:8000` instead of hand-rolling a fully-qualified extra_endpoints URL that bakes in the per-user DGD name. - Extend `DEFAULT_METRIC_PREFIXES` with `vllm:`, `sglang:`, `trtllm_`, and `DCGM_` so the include filter doesn't drop any backend metric just because the prefix wasn't predeclared. - Coerce dict/BaseModel at the entry point of `maybe_start_dynamo_prometheus_monitor` so the same code path keeps working after upstream PR NVIDIA-NeMo#2325 swapped `MasterConfig` from a TypedDict to a Pydantic model. * Recipe (`grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml`): - Remove the `metric_prefixes: ["dynamo_"]` override that was silently filtering out every `vllm:*`, `sglang:*`, and `trtllm_*` sample before they reached `samples.jsonl` (raw scrapes still had them; the dashboard didn't). - Add `extra_endpoints` pointing at the in-cluster DCGM exporter. - Add `frontend:8000` to `service_names` so the frontend `dynamo_frontend_*` gauges actually get scraped. - Set `include_histogram_buckets: true` so `*_bucket` samples flow into `data.openmetrics` and `export_metric_names`. Without this, the Latency panels (TTFT / ITL / E2E / Time in Queue) and the Workload Characterization heatmaps all reference `histogram_quantile(rate(*_bucket[2m]))` series that aren't in the export set, and the dashboard exporter's `_filter_grafana_panels` correctly prunes them — so they end up missing from the offline bundle even though the underlying vllm scrape did emit them. * `examples/nemo_gym/run_dynamo_rollout_only.py`: adapt the rollout-only driver to the new `MasterConfig` BaseModel API (attribute access on policy/env/data/logger, `OmegaConf.to_container` + `MasterConfig(**config)` instantiation) so the upstream rename doesn't crash the smoke immediately at config load. * `tests/unit/models/generation/test_dynamo_prometheus.py`: matching test updates for the dashboard / config changes. * Entrypoint hardening for the GB300 customer-cpu mini-swe rollout infra YAML (`infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml`): - `git config --global --add safe.directory '*'` so the PVC-rsync'd checkout, owned by the host uid that wrote it, doesn't trip "dubious ownership" inside a pod running as root. - `uv sync --frozen` + `NRL_FORCE_REBUILD_VENVS=true` so the container venv catches up to the PVC's `uv.lock` when the baked image's pinned deps drift from the synced checkout (e.g. the `tensordict` dependency that landed via PR NVIDIA-NeMo#2439 after the `664d29c-49528955` image was baked). - Explicit fail-fast preflight on `/mnt/rl-workspace` mount, the recipe yaml's presence on the synced checkout, and `singularity` being on PATH after postStart. Plain `set -eu` would surface these as confusing later errors; the explicit checks turn them into one-line diagnostics. Verified against exp_011 on the GB300 smoke: 3 endpoints scraping cleanly (vllmdecodeworker, frontend, dcgm — 12 successful scrapes each), `data.openmetrics` has DCGM 1874 / vllm 4104 (with 3396 bucket lines) / dynamo_frontend 2556, exported dashboard 48 / 48 panels with 0 pruned. Co-Authored-By: Joyang Yan <joyang@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com>

…is pipeline Adds a new `examples/swe_bench/` tree where each subfolder is a complete, checked-in spec for one SWE-bench rollout experiment, paired with an `analysis.py` post-processor that turns the raw rollout output into a human-readable summary + a per-LLM-call jsonl that's portable to other tools. Anyone in the same namespace can reproduce a run by name with one command instead of stitching together a recipe, an nrl-k8s infra, and a DGD manifest spread across three different repo paths, and can re-aggregate any past exp_NNN offline without re-running the rollout. * `examples/swe_bench/run_test.sh` — driver wrapper that takes an experiment name as a positional arg, resolves it to `<exp>/recipe.yaml` + `<exp>/infra.gb300.yaml` (the infra YAML in turn references its sibling `dgd.gb300.yaml`), and walks the standard reinstall → check → run → tail → teardown cycle. PVC rsync is deliberately *not* part of the script — silent auto-sync turned out to be the failure mode behind a long debugging session where a misset `RL_ROOT` shipped `examples/nemo_gym/` to PVC's top level and produced a namespace-package that shadowed Gym's real `nemo_gym` import at runtime. Operators sync the checkout themselves before invoking the script. * `examples/swe_bench/qwen3_30b_a3b_instruct_2507/` — first experiment. Production rollout against the SWE-bench Verified arm64 subset with mini_swe_agent step_limit=50, concurrency=16, and a 4 worker × 2 GPU × TP=2 DGD on the GB300 customer-cpu fleet. The recipe references `../../nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml` as its `defaults:` base, and the infra entrypoint runs the recipe with only two per-user Hydra injects (`dgd_name`, `frontend_port`). * `examples/swe_bench/README.md` — folder convention, prereqs (auth, HF cache, SIF cache, dataset row count, KAI queue quota, smoke cluster down), the one-line run command, and the verify-after-finish checklist. Includes guidance for adding new experiments by copying an existing folder and editing the three YAMLs in place. * `examples/swe_bench/__init__.py` + `examples/swe_bench/analysis.py` — post-rollout analyzer runnable both as a library function (called from the driver's archive step via `from examples.swe_bench.analysis import write_summary`) and standalone for offline re-aggregation (`python examples/swe_bench/analysis.py /path/to/exp_NNN`). Reads: - `exp_dir/trajectory_collection.jsonl` (NeMo-RL row-per-instance) - `exp_dir/trajectories/results/verified/<model>/<instance>/*.traj.json` and `report_*.json` (mini_swe_agent's per-instance native dump) - `exp_dir/qwen*-{before,after}-*.prom` (optional; vllm snapshots the entrypoint curls around the rollout) Writes: - `summary.txt` — five-bucket mutually-exclusive outcome partition (resolved / no_report / patch_failed_apply / empty_patch / wrong_patch — every instance lands in exactly one), independent `exit_status` enum flags (Submitted / LimitsExceeded / CollapseContinued / RetryError / etc., harvested from the data so new enums show up automatically), `truncation_in_any_response` count, per-repo breakdown, token + agent-step aggregates with p50/p95 distributions, latency percentiles (min / p10 / p50 / p90 / p95 / p99 / max / avg), slow-tail (>5 s) request count + excess time, and a server-side time breakdown from the vllm prom diff (Σ TTFT vs Σ e2e split into prefill / decode, with an "implied replicas" multiplier so multi-worker DGD users can reconstruct the aggregate from the single-pod snapshot). - `tool_call_timings.jsonl` — one row per LLM API call with token usage, measured `llm_time_ms` (vllm nvext.timing.total_time_ms), derived `tool_time_s` (gap to the next call — agent-side bash + parsing), and `est_prefill_ms` / `est_decode_ms` / `est_itl_ms_used` columns derived from the run's global ITL (vllm responses don't expose per-request TTFT, so this split is necessarily an estimate — the `est_` prefix is explicit). * `examples/nemo_gym/run_dynamo_rollout_only.py` — `_dump_swe_archive_layout` decorates exp_NNN/ after the existing `_log_trajectory_collection` call so the experiment bundle matches the standalone mini_swe_agent archive shape (see e.g. `script/dynamo/SWE/results/qwen3-30b-instruct-2507-vllm-v3-*`). Three artifacts: - `trajectories/` — real dir holding an absolute-path `results` symlink to mini_swe_agent's native `responses_api_agents/.../results/` tree, so per-instance `.traj.json` / `report.json` / `patch.diff` / `run_instance.log` / `test_output.txt` are reachable from one place. Symlink (not copy) keeps the bundle live and ~free of storage cost; the `results/` middle layer matches the standalone archive convention. - `manifests/` — copy of the recipe + sibling infra/DGD YAMLs so the exp_NNN bundle stays self-describing even if the checked-in files drift afterward. Also lets `analysis.py` autodetect the model when re-aggregated offline. - `summary.txt` + `tool_call_timings.jsonl` via the shared `analysis.write_summary`, wrapped in try/except so an analysis failure can't crash the driver after the trajectories/ and manifests/ steps already succeeded. Validated by re-aggregating jthomson04's 2026-05-19 production archive (281 instances) — the partition counts (33 resolved + 1 no_report + 1 patch_failed_apply + 91 empty_patch + 155 wrong_patch = 281, four exit_status enums, 10 truncation instances) match the per-file probe on the archive — and by running offline against our own PVC exp_003 (16-instance smoke) where the implied-replicas multiplier comes out to 4.23, matching the actual production 4-worker DGD. Co-Authored-By: Joyang Yan <joyang@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian and others added 4 commits May 25, 2026 01:40

RayenTian force-pushed the ruit/joyang/dynamo_rollout_poc branch from b585267 to f9ab8e4 Compare May 26, 2026 09:12

RayenTian added 3 commits May 26, 2026 11:42

update recipe

f15ce39

Signed-off-by: ruit <ruit@nvidia.com>

add log seperate log dir

75a083e

Signed-off-by: ruit <ruit@nvidia.com>

fix tokenizer 500

54397b9

Signed-off-by: ruit <ruit@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruit/joyang/dynamo rollout poc#1

Ruit/joyang/dynamo rollout poc#1
RayenTian wants to merge 7 commits into
dynamo-k8s-integrationfrom
ruit/joyang/dynamo_rollout_poc

RayenTian commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RayenTian commented May 26, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant