Skip to content

Ruit/joyang/dynamo rollout poc#1

Draft
RayenTian wants to merge 7 commits into
dynamo-k8s-integrationfrom
ruit/joyang/dynamo_rollout_poc
Draft

Ruit/joyang/dynamo rollout poc#1
RayenTian wants to merge 7 commits into
dynamo-k8s-integrationfrom
ruit/joyang/dynamo_rollout_poc

Conversation

@RayenTian

Copy link
Copy Markdown
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

RayenTian and others added 4 commits May 25, 2026 01:40
…us monitoring

Adds the rollout-only Dynamo path layer on top of dynamo-k8s-integration:

* nemo_rl/models/generation/dynamo/monitoring/ — new subpackage:
  - prometheus.py: scrapes the DGD's worker /metrics endpoint(s) on a
    background thread, writes raw_scrapes.jsonl + samples.jsonl + an
    OpenMetrics-formatted dump for offline Prometheus TSDB replay.
  - grafana.py + grafana_dashboard_template.json: builds a Grafana
    dashboard clipped to the metrics actually captured (panels referring
    to unseen PromQL series are dropped), with the export's time window
    baked into the default range.
  - __init__.py.

* examples/nemo_gym/run_dynamo_rollout_only.py — narrower entrypoint than
  run_grpo_nemo_gym.py. Skips the train policy / logprob stacks so a
  smoke can reserve all GPUs for Dynamo rollout serving, hooks in the
  Prometheus exporter via maybe_start_dynamo_prometheus_monitor.

* examples/nemo_gym/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml —
  recipe wiring policy.generation.backend=dynamo + the Prometheus
  exporter config; mini-swe-agent rollout config under env.nemo_gym.

* infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml
  — RayCluster + DGD pair. RayCluster's gpu-workers are sized to 0
  because run_dynamo_rollout_only.py doesn't need a train cluster.
  Per-user values (`${user:}`) cover the KAI queue, RayCluster name,
  DGD name, log/metrics dirs, and the HF_HOME injected into the worker
  via dynamo.serving.overrides (DGD YAML itself is loaded as plain
  YAML, so no OmegaConf interpolation there). The entrypoint sets
  git safe.directory to '*' so PVC-rsync'd repos don't trip git's
  "dubious ownership" check, then post-runs co-locates the before/after
  prom snapshots into the driver's exp_NNN dir.

* infra/nrl_k8s/examples_dgd/qwen3_30b_a3b_instruct_2507_gb300.yaml
  — DGD manifest. 1x VllmDecodeWorker on GB300, served via the dynamo
  frontend. Replicas/GPU/TP are designed to be bumped uniformly via the
  infra YAML's dynamo.serving.overrides; the manifest itself is fully
  user-agnostic.

Companion edits to existing files (additive, no behavior change for
existing recipes):

* nemo_rl/models/generation/dynamo/config.py — add
  `prometheus_metrics: NotRequired[dict[str, Any]]` to DynamoCfg.

* nemo_rl/utils/logger.py — add `group` + `job_type` NotRequired fields
  to WandbConfig.

* tests/unit/models/generation/test_dynamo_prometheus.py — pure-mock
  unit tests for the Prometheus monitor's collection + export shape.

Smoke-validated end-to-end on GB300 against jthomson04/dynamo-k8s-integration
tip: 1 SWE-bench Verified sample / step_limit=1 / 1 GPU / TP=1, ~81s
rollout. Before/after vllm prom snapshots confirm real traffic reached
vllm via the Dynamo frontend (vllm:prompt_tokens_total 0→1535,
vllm:generation_tokens_total 0→69).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Jonas Yang <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Three independent nrl-k8s bugs surfaced while bringing the
mini-swe Qwen3-30B-A3B-Instruct-2507 Dynamo rollout up on the
GB300 customer-cpu fleet under AWS SSO. None affects steady-state
training; together they're what kept the smoke from succeeding.

1. `k8s.py` — kubernetes-client 36.0.0 changed the auth header key
   between `load_kube_config()` (writes `authorization`) and
   `auth_settings()` (reads `BearerToken`). `load_kubeconfig()` now
   aliases the bearer token under both keys so SDK calls don't 401
   immediately after a fresh `aws eks get-token`.

2. `cli.py` — `nrl-k8s job list` crashed with
   `TypeError: unsupported format string passed to NoneType.__format__`
   when an exec-submitter run had no submission_id. Fall back to
   "(driver)" in the formatter.

3. `dgd.py` — `_frontend_http_ready` treats HTTP 403 from the API
   server's services/proxy verb as ready. The smoke account's AWS
   SSO role legitimately lacks the proxy RBAC verb, but the upstream
   readiness gates (DGD status, Service endpoints) already confirm
   the frontend is healthy. 403 here is "not allowed to peek", not
   "not ready", and should not block the wait.

4. `orchestrate.py` — `bring_up_cluster` / `ensure_cluster`
   `ready_timeout_s` default goes 900 → 1800. The customer-cpu arm64
   nodegroup has a postStart hook doing `apt-get install
   singularity-container squashfuse`, which routinely takes 15-20 min
   on the public Ubuntu arm64 mirror. The old 15-minute deadline was
   inside that window; the new 30-minute one isn't.

Signed-off-by: ruit <ruit@nvidia.com>
… dashboard fidelity + entrypoint hardening

Iterates on the initial Dynamo Prometheus monitoring POC to make the
offline Grafana bundle (data.openmetrics + grafana-dashboard.json +
prometheus-offline.yml) actually self-sufficient and self-explanatory:
every panel in the bundled dashboard either renders real data or is
honestly pruned at export time, with no silent "looks empty" panels
from missing scrape coverage or upstream API drift. Also rolls in the
mini-swe Dynamo rollout entrypoint hardening that surfaced alongside
this work — guards against image / uv.lock drift that bit us on every
fresh head pod.

* Dashboard template: swap the legacy POC dashboard for the jonas
  vllm-focused build that has all 8 rows / 48 panels we care about
  (Deployment Config / Overview / Throughput {bytes,tokens} / Latency
  / KV Cache & Queue Dynamics / Workload Characterization / GPU
  (DCGM)). Templated via `$pod_regex` so the same JSON works across
  per-user DGDs. Strip the `__inputs` / `__requires` import-wizard
  metadata so post-replay imports don't fall into Grafana's
  interactive datasource-resolution path with already-rewritten UIDs.

* Prometheus monitor (`prometheus.py`):
    - Add `extra_endpoints` so we can scrape cluster-wide DaemonSets
      (e.g. `nvidia-dcgm-exporter.gpu-operator:9400`) alongside the
      DGD's own services. Required to populate the GPU (DCGM) row,
      which the existing service-template URL can't synthesize.
    - Let each `service_names` entry carry an optional `:port`
      suffix. The Dynamo frontend exposes config-style gauges
      (`dynamo_frontend_model_context_length`,
      `dynamo_frontend_model_max_num_batched_tokens`,
      `dynamo_frontend_model_max_num_seqs`,
      `dynamo_frontend_model_total_kv_blocks`) on the 8000 HTTP API
      port, not the 9090 backend metrics port. With per-service
      ports, the recipe can keep using the convenience template for
      both `vllmdecodeworker` (9090) and `frontend:8000` instead of
      hand-rolling a fully-qualified extra_endpoints URL that bakes
      in the per-user DGD name.
    - Extend `DEFAULT_METRIC_PREFIXES` with `vllm:`, `sglang:`,
      `trtllm_`, and `DCGM_` so the include filter doesn't drop any
      backend metric just because the prefix wasn't predeclared.
    - Coerce dict/BaseModel at the entry point of
      `maybe_start_dynamo_prometheus_monitor` so the same code path
      keeps working after upstream PR NVIDIA-NeMo#2325 swapped `MasterConfig`
      from a TypedDict to a Pydantic model.

* Recipe (`grpo_mini_swe_qwen3_30b_a3b_instruct_2507.yaml`):
    - Remove the `metric_prefixes: ["dynamo_"]` override that was
      silently filtering out every `vllm:*`, `sglang:*`, and
      `trtllm_*` sample before they reached `samples.jsonl` (raw
      scrapes still had them; the dashboard didn't).
    - Add `extra_endpoints` pointing at the in-cluster DCGM
      exporter.
    - Add `frontend:8000` to `service_names` so the frontend
      `dynamo_frontend_*` gauges actually get scraped.
    - Set `include_histogram_buckets: true` so `*_bucket` samples
      flow into `data.openmetrics` and `export_metric_names`. Without
      this, the Latency panels (TTFT / ITL / E2E / Time in Queue)
      and the Workload Characterization heatmaps all reference
      `histogram_quantile(rate(*_bucket[2m]))` series that aren't in
      the export set, and the dashboard exporter's
      `_filter_grafana_panels` correctly prunes them — so they end
      up missing from the offline bundle even though the underlying
      vllm scrape did emit them.

* `examples/nemo_gym/run_dynamo_rollout_only.py`: adapt the
  rollout-only driver to the new `MasterConfig` BaseModel API
  (attribute access on policy/env/data/logger, `OmegaConf.to_container`
  + `MasterConfig(**config)` instantiation) so the upstream rename
  doesn't crash the smoke immediately at config load.

* `tests/unit/models/generation/test_dynamo_prometheus.py`: matching
  test updates for the dashboard / config changes.

* Entrypoint hardening for the GB300 customer-cpu mini-swe rollout
  infra YAML (`infra/nrl_k8s/examples/grpo_mini_swe_qwen3_30b_a3b_instruct_2507.rollout.gb300.infra.yaml`):
    - `git config --global --add safe.directory '*'` so the
      PVC-rsync'd checkout, owned by the host uid that wrote it,
      doesn't trip "dubious ownership" inside a pod running as root.
    - `uv sync --frozen` + `NRL_FORCE_REBUILD_VENVS=true` so the
      container venv catches up to the PVC's `uv.lock` when the baked
      image's pinned deps drift from the synced checkout (e.g. the
      `tensordict` dependency that landed via PR NVIDIA-NeMo#2439 after the
      `664d29c-49528955` image was baked).
    - Explicit fail-fast preflight on `/mnt/rl-workspace` mount, the
      recipe yaml's presence on the synced checkout, and `singularity`
      being on PATH after postStart. Plain `set -eu` would surface
      these as confusing later errors; the explicit checks turn them
      into one-line diagnostics.

Verified against exp_011 on the GB300 smoke: 3 endpoints scraping
cleanly (vllmdecodeworker, frontend, dcgm — 12 successful scrapes
each), `data.openmetrics` has DCGM 1874 / vllm 4104 (with 3396
bucket lines) / dynamo_frontend 2556, exported dashboard 48 / 48
panels with 0 pruned.

Co-Authored-By: Joyang Yan <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…is pipeline

Adds a new `examples/swe_bench/` tree where each subfolder is a complete,
checked-in spec for one SWE-bench rollout experiment, paired with an
`analysis.py` post-processor that turns the raw rollout output into a
human-readable summary + a per-LLM-call jsonl that's portable to other
tools. Anyone in the same namespace can reproduce a run by name with one
command instead of stitching together a recipe, an nrl-k8s infra, and a
DGD manifest spread across three different repo paths, and can
re-aggregate any past exp_NNN offline without re-running the rollout.

* `examples/swe_bench/run_test.sh` — driver wrapper that takes an
  experiment name as a positional arg, resolves it to
  `<exp>/recipe.yaml` + `<exp>/infra.gb300.yaml` (the infra YAML in turn
  references its sibling `dgd.gb300.yaml`), and walks the standard
  reinstall → check → run → tail → teardown cycle. PVC rsync is
  deliberately *not* part of the script — silent auto-sync turned out
  to be the failure mode behind a long debugging session where a misset
  `RL_ROOT` shipped `examples/nemo_gym/` to PVC's top level and
  produced a namespace-package that shadowed Gym's real `nemo_gym`
  import at runtime. Operators sync the checkout themselves before
  invoking the script.

* `examples/swe_bench/qwen3_30b_a3b_instruct_2507/` — first experiment.
  Production rollout against the SWE-bench Verified arm64 subset with
  mini_swe_agent step_limit=50, concurrency=16, and a 4 worker × 2 GPU
  × TP=2 DGD on the GB300 customer-cpu fleet. The recipe references
  `../../nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml`
  as its `defaults:` base, and the infra entrypoint runs the recipe
  with only two per-user Hydra injects (`dgd_name`, `frontend_port`).

* `examples/swe_bench/README.md` — folder convention, prereqs (auth,
  HF cache, SIF cache, dataset row count, KAI queue quota, smoke
  cluster down), the one-line run command, and the verify-after-finish
  checklist. Includes guidance for adding new experiments by copying
  an existing folder and editing the three YAMLs in place.

* `examples/swe_bench/__init__.py` + `examples/swe_bench/analysis.py` —
  post-rollout analyzer runnable both as a library function (called
  from the driver's archive step via `from examples.swe_bench.analysis
  import write_summary`) and standalone for offline re-aggregation
  (`python examples/swe_bench/analysis.py /path/to/exp_NNN`). Reads:
    - `exp_dir/trajectory_collection.jsonl` (NeMo-RL row-per-instance)
    - `exp_dir/trajectories/results/verified/<model>/<instance>/*.traj.json`
      and `report_*.json` (mini_swe_agent's per-instance native dump)
    - `exp_dir/qwen*-{before,after}-*.prom` (optional; vllm snapshots
      the entrypoint curls around the rollout)
  Writes:
    - `summary.txt` — five-bucket mutually-exclusive outcome partition
      (resolved / no_report / patch_failed_apply / empty_patch /
      wrong_patch — every instance lands in exactly one), independent
      `exit_status` enum flags (Submitted / LimitsExceeded /
      CollapseContinued / RetryError / etc., harvested from the data
      so new enums show up automatically), `truncation_in_any_response`
      count, per-repo breakdown, token + agent-step aggregates with
      p50/p95 distributions, latency percentiles (min / p10 / p50 / p90
      / p95 / p99 / max / avg), slow-tail (>5 s) request count + excess
      time, and a server-side time breakdown from the vllm prom diff
      (Σ TTFT vs Σ e2e split into prefill / decode, with an
      "implied replicas" multiplier so multi-worker DGD users can
      reconstruct the aggregate from the single-pod snapshot).
    - `tool_call_timings.jsonl` — one row per LLM API call with token
      usage, measured `llm_time_ms` (vllm nvext.timing.total_time_ms),
      derived `tool_time_s` (gap to the next call — agent-side bash +
      parsing), and `est_prefill_ms` / `est_decode_ms` /
      `est_itl_ms_used` columns derived from the run's global ITL
      (vllm responses don't expose per-request TTFT, so this split is
      necessarily an estimate — the `est_` prefix is explicit).

* `examples/nemo_gym/run_dynamo_rollout_only.py` —
  `_dump_swe_archive_layout` decorates exp_NNN/ after the existing
  `_log_trajectory_collection` call so the experiment bundle matches
  the standalone mini_swe_agent archive shape (see e.g.
  `script/dynamo/SWE/results/qwen3-30b-instruct-2507-vllm-v3-*`).
  Three artifacts:
    - `trajectories/` — real dir holding an absolute-path `results`
      symlink to mini_swe_agent's native
      `responses_api_agents/.../results/` tree, so per-instance
      `.traj.json` / `report.json` / `patch.diff` / `run_instance.log`
      / `test_output.txt` are reachable from one place. Symlink (not
      copy) keeps the bundle live and ~free of storage cost; the
      `results/` middle layer matches the standalone archive convention.
    - `manifests/` — copy of the recipe + sibling infra/DGD YAMLs so
      the exp_NNN bundle stays self-describing even if the checked-in
      files drift afterward. Also lets `analysis.py` autodetect the
      model when re-aggregated offline.
    - `summary.txt` + `tool_call_timings.jsonl` via the shared
      `analysis.write_summary`, wrapped in try/except so an analysis
      failure can't crash the driver after the trajectories/ and
      manifests/ steps already succeeded.

Validated by re-aggregating jthomson04's 2026-05-19 production archive
(281 instances) — the partition counts (33 resolved + 1 no_report + 1
patch_failed_apply + 91 empty_patch + 155 wrong_patch = 281, four
exit_status enums, 10 truncation instances) match the per-file probe
on the archive — and by running offline against our own PVC exp_003
(16-instance smoke) where the implied-replicas multiplier comes out
to 4.23, matching the actual production 4-worker DGD.

Co-Authored-By: Joyang Yan <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian force-pushed the ruit/joyang/dynamo_rollout_poc branch from b585267 to f9ab8e4 Compare May 26, 2026 09:12
RayenTian added 3 commits May 26, 2026 11:42
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant