diff --git a/examples/swe_bench/REPRO_swe2.md b/examples/swe_bench/REPRO_swe2.md new file mode 100644 index 0000000000..8914567a5c --- /dev/null +++ b/examples/swe_bench/REPRO_swe2.md @@ -0,0 +1,398 @@ +# Reproducing the baseline SWE2 Async-GRPO run + +Step-by-step guide to reproduce baseline's successful SWE2 GRPO run +(wandb `nvidia/binhu-nemo-rl/dc3m70us`) using: + +- **Cluster:** `cw-dfw-cs` +- **Branch:** `ruit/SWE_bench` (repo `github.com/NVIDIA-NeMo/RL`) +- **Launcher:** `${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh` +- **Config:** `${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml` (passed to the launcher via `--config`) + +The goal of this run is to confirm that the earlier *zero-reward* failure was +caused by the **container / vLLM** (a broken hermes tool parser that prevented +the agent from emitting real tool calls), **not** by the model or the config. +A correct repro resolves ~8% of SWE-bench instances starting from step 1. + +> Run this on **`cw-dfw-cs`**. **Do not run from anyone else's checkout** — clone +> the repo into your own workspace (§2.1). `REPO_ROOT` below means *your* clone; +> the launcher auto-detects it from its own location. The model / data / container +> paths are absolute and world-readable on the `cw-dfw-cs` Lustre. + +--- + +## 1. What this run is + +| Item | Value | +|------|-------| +| Algorithm | Async GRPO (non-colocated generation) | +| Model | Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / 3B active) | +| Init checkpoint | SWE1 `step_230_hf` (the exact checkpoint dc3m70us trained from) | +| Train data | R2E-Gym subset (`swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl`) | +| Eval data | same JSONL (val == train path here) | +| Env | `swe_agents` (OpenHands agent inside an apptainer/singularity sandbox) | +| Entry point | `${REPO_ROOT}/examples/nemo_gym/run_grpo_nemo_gym.py` | +| Scheduler | SLURM (`sbatch` + `ray.sub`) | + +--- + +## 2. Prerequisites + +### 2.1 Get the code (clone into your own workspace) + +On `cw-dfw-cs`, clone the repo into a directory you own and check out the +`ruit/SWE_bench` branch. Do **not** run from someone else's checkout. + +```bash +cd /lustre/ +git clone https://github.com/NVIDIA-NeMo/RL.git +cd RL +git checkout ruit/SWE_bench +git submodule update --init --recursive # needed for the Gym mount (3rdparty/Gym-workspace/Gym) + +export REPO_ROOT="$PWD" # = your clone; the launcher also auto-detects this +``` + +> The launcher runs the code **in place** from your clone (`SNAPSHOT_DIR == +> REPO_ROOT`). Whatever is in the working tree at submit time is what runs, so +> avoid stray local edits. + +### 2.2 Container + +Uses the **SWE training container** (`ruit-swe_bench`, with mcore + apptainer baked +in), NOT the default NeMo-RL image. Its vLLM has the working hermes tool parser, so +the agent emits real `function_call` items (a broken parser was the original +zero-reward failure mode): + +``` +/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs +``` + +It is wired in via the `CONTAINER` env var (overridable). The job mounts: + +``` +/lustre -> /lustre +${REPO_ROOT} -> (same path; $PWD) +${REPO_ROOT}/3rdparty/Gym-workspace/Gym -> /opt/nemo-rl/3rdparty/Gym-workspace/Gym +``` + +The last mount overlays the in-repo Gym source over the container's Gym so +your Gym checkout is what runs. + +### 2.3 Required files on Lustre + +Confirm these absolute paths exist before submitting: + +| Path | Purpose | +|------|---------| +| `/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf` | init checkpoint | +| `/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl` | train + val data | +| `${REPO_ROOT}/ray.sub` | SLURM launcher consumed by `sbatch` | +| `/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs` | training container | + +Per-instance SWE-bench `.sif` sandbox images (resolved by `container_formatter` +in the YAML, first match wins): + +``` +/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif +/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif +/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif +/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif +``` + +### 2.4 Tokens / credentials + +Credentials were **stripped from this shared copy** of the launcher — it no +longer sources any env script. Before submitting, export these yourself: + +- `HF_HOME` — HuggingFace cache root (passed through to the job; also used to + derive `HF_DATASETS_CACHE`) +- `HF_TOKEN` — required if the model/tokenizer is gated +- `WANDB_API_KEY` — required for wandb logging (`logger.wandb_enabled=True`) +- `GITHUB_TOKEN` — only if your data/repo access needs it + +### 2.5 Caches (created automatically, listed for reference) + +The launcher seeds vLLM/inductor/triton caches from a persistent dir under your +own `$HOME` (override with the `PERSISTENT_CACHE` env var): + +``` +Persistent (default ${HOME}/.cache/qwen3_30b_thinking_swe_repro_baseline): + .../vllm_compile_cache + .../inductor_cache + .../triton_cache + +Node-local (/tmp, recreated each run): + /tmp/nemo_rl_vllm_cache + /tmp/nemo_rl_inductor_cache + /tmp/nemo_rl_triton_cache + /tmp/uv_cache +``` + +--- + +## 3. Key configuration (what gets reproduced) + +The launcher overrides the YAML on the command line. The values that define +the run: + +**Cluster / parallelism** +- `NUM_NODES=16` actor nodes, `8` generation nodes (async, non-colocated), 8 GPUs/node +- `TP=4`, `EP=8`, `CP=4`, `PP=2`, `vLLM_TP=2` +- `make_sequence_length_divisible_by = 32` (auto: `CP*2*TP = 4*2*4`) + +**GRPO / sampling** +- `num_prompts_per_step=8`, `num_generations_per_prompt=8`, `train_global_batch_size=64` +- `normalize_rewards=True`, `overlong_filtering=True` +- Async: `max_trajectory_age_steps=1`, `in_flight_weight_updates=True`, + `recompute_kv_cache_after_weight_updates=False`, `force_on_policy_ratio=True` +- `advantage_clip=[-100, 100]`, `truncated_importance_sampling_ratio=5` + +**Loss** +- `reference_policy_kl_penalty=0` (no KL), `ratio_clip=[0.2, 0.28]` +- `token_level_loss=True`, `use_importance_sampling_correction=True`, + `sequence_level_importance_ratios=False` + +**Optimizer / model** +- `lr=1e-06` (constant), `weight_decay=0` +- `max_total_sequence_length=131072`, sequence packing on +- MoE: router frozen, `moe_aux_loss_coeff=0`, `alltoall` dispatcher, deepep off + +**SWE agent** +- `agent_max_turns=200`, `swebench_agent_timeout=1800` + +**Logging** +- wandb project `swe-benchmark`, full Gym responses logged + (`should_log_nemo_gym_responses=true`) so you can verify `function_call` + items actually appear. + +--- + +## 4. Command flavor (why it differs from the default) + +The training command is **baseline-style**, which is what makes the container work: + +- `uv run --frozen --extra mcore` (frozen lockfile) +- `NRL_IGNORE_VERSION_MISMATCH=1` — tolerate the container's vLLM version +- `NEMO_GYM_SKIP_VENV_IF_PRESENT=1` — reuse the container's Gym venv, don't rebuild +- `NRL_FORCE_REBUILD_VENVS=false`, `RAY_ENABLE_UV_RUN_RUNTIME_ENV=0` +- vLLM caches seeded from a persistent Lustre cache, then synced back + +The `SETUP_COMMAND` (run once per node before training) installs +apptainer/singularity (for the SWE sandbox), clears + seeds the inductor/triton +caches from Lustre, and runs `uv sync --frozen --extra mcore`. + +--- + +## 5. Step-by-step + +```bash +# 1. Clone into your own workspace on cw-dfw-cs and check out the branch +cd /lustre/ +git clone https://github.com/NVIDIA-NeMo/RL.git +cd RL +git checkout ruit/SWE_bench +git submodule update --init --recursive # Gym mount (3rdparty/Gym-workspace/Gym) +export REPO_ROOT="$PWD" + +# 2. Export your credentials (stripped from this copy — see §2.4) +export HF_HOME=/your/hf/home +export HF_TOKEN=... # if the model/tokenizer is gated +export WANDB_API_KEY=... # for wandb logging +export GITHUB_TOKEN=... # only if your data/repo access needs it + +# 3. (Optional) sanity-check the shared assets exist (readable on cw-dfw-cs) +ls "/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs" +ls -d "/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf" +ls "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl" + +# 4. Submit from your clone. Defaults reproduce dc3m70us; no other args needed. +bash "${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh" +``` + +The script prints a summary, submits via `sbatch`, and writes the job id to +`${REPO_ROOT}/latest_repro_baseline_job_id.txt`. + +### Overridable knobs (env vars) + +| Var | Default | Effect | +|-----|---------|--------| +| `MODEL_PATH` (also `$1`) | `/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf` | init checkpoint | +| `CONTAINER` | `/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs` | training image | +| `NUM_NODES` | 16 | actor nodes | +| `NUM_GEN_NODES` | 8 | generation nodes (async only) | +| `SKIP_TRAINING` | `0` | `1` = generation-only benchmark: no-op training pinned to 1 node (see §9) | +| `EXP_SUFFIX` | `repro-baseline-swe2-async-age1-pps8-gpp8-gbs64-lr1e-06-tp4` | run + checkpoint dir name (`notrain-` is inserted when `SKIP_TRAINING=1`) | +| `BASE_LOG_DIR` | `${REPO_ROOT}/logs/slurm` | SLURM/Ray logs | + +Example — different init checkpoint, smaller cluster: + +```bash +NUM_NODES=8 NUM_GEN_NODES=4 \ + bash ${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh \ + /path/to/other/step_X_hf +``` + +--- + +## 6. Monitoring + +```bash +JOB_ID=$(cat ${REPO_ROOT}/latest_repro_baseline_job_id.txt) + +squeue -j "$JOB_ID" # queue state +ls ${REPO_ROOT}/logs/slurm/${JOB_ID}-logs/ # Ray + SLURM logs +tail -f ${REPO_ROOT}/logs/slurm/slurm-${JOB_ID}.out # driver output +``` + +- **wandb:** project `swe-benchmark`, run name = `EXP_SUFFIX`. +- **Checkpoints:** + `${REPO_ROOT}/results/${EXP_SUFFIX}/` + (save every 5 steps, keep top 2 by `train:total_reward/mean`). + +### What "success" looks like +- `train:total_reward/mean` is **non-zero from step ~1** (the failure mode was + identically zero reward). +- Logged Gym responses contain real `function_call` items (proves the hermes + tool parser is working in this container). +- Resolved rate climbs toward ~8%. + +--- + +## 7. Troubleshooting + +| Symptom | Likely cause | Fix | +|---------|-------------|-----| +| Reward is identically 0 | wrong container — hermes tool parser broken, no tool calls | confirm `CONTAINER` is the `ruit-swe_bench` squashfs, not the default image | +| `version mismatch` abort | strict version check | ensure `NRL_IGNORE_VERSION_MISMATCH=1` is in the command (it is, by default) | +| Gym venv rebuild / slowness | venv rebuilt instead of reused | confirm `NEMO_GYM_SKIP_VENV_IF_PRESENT=1` and the Gym mount are present | +| Agent can't start sandbox | apptainer/singularity missing or `.sif` images missing | check `SETUP_COMMAND` apptainer install succeeded; verify `container_formatter` paths in the YAML | +| Token / auth errors | credentials not exported (this copy ships none) | `export HF_HOME`/`HF_TOKEN`/`WANDB_API_KEY`/`GITHUB_TOKEN` before submitting (see §2.4) | +| OOM / parallelism mismatch | changed `TP`/`EP`/`CP`/`PP` without re-deriving `make_sequence_length_divisible_by` | keep the default parallelism, or recompute `MIN_PAD = CP*2 * TP` | + +--- + +## 8. Reference — exact pinned values + +``` +Code: NeMo-RL @ branch ruit/SWE_bench (run in place from your clone) +Compute: cw-dfw-cs (SLURM) +Repo: github.com/NVIDIA-NeMo/RL @ branch ruit/SWE_bench +REPO_ROOT: your clone (export REPO_ROOT=; launcher also auto-detects it) +Container: /lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs +Init model: /lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf +Train data: /lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl +Config: ${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml +Launcher: ${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh +Mode: async-age1, colocated=False +Resources: 16 actor nodes + 8 gen nodes, 8 GPU/node +Parallelism: TP=4, EP=8, CP=4, PP=2, vLLM_TP=2, pad=32 +Training: PPS=8, GPP=8, GBS=64, LR=1e-06 +Loss: KL=0, clip=[0.2,0.28], token-level, IS correction on, TIS=5 +Agent: max_turns=200, timeout=1800s +wandb: project=swe-benchmark +Baseline: nvidia/binhu-nemo-rl/dc3m70us (~8% resolved from step 1) +``` + +--- + +## 9. Generation-only benchmark (skip training) + +For **benchmarking generation throughput / scaling** without paying for real +training, the launcher has a no-op-training mode, gated by the +`grpo.gen_benchmark_skip_training` flag (added on `ruit/SWE_bench`). Set +`SKIP_TRAINING=1`: + +```bash +SKIP_TRAINING=1 bash "${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh" +``` + +### What it does +- **`policy.train()` becomes a no-op** — no forward/backward, no optimizer step. The + weights stay frozen at the init checkpoint and are **still refit to vLLM every + step**, so the async generation / weight-sync cadence stays realistic. +- **No optimizer is built** (`init_optimizer=False`) — saves memory and startup time. +- A tiny **keep-alive matmul daemon** runs on each training worker so the cluster's + idle-GPU reaper doesn't kill the (otherwise idle) training node. +- **Checkpoint saving is disabled** (`checkpointing.enabled=false`) — there is no + optimizer/training state to save. + +### What the launcher changes automatically when `SKIP_TRAINING=1` +- Training parallelism → **`TP=8, EP=8, CP=1, PP=1`** (model-parallel = 8, fits one + node; `train_DP=1`), so training is pinned to a **single node**. +- `NUM_ACTOR_NODES = NUM_GEN_NODES + 1` → total nodes = `gen + 1` (default `8 + 1 = 9`; + 8 generation nodes = 32 vLLM replicas at `vLLM_TP=2`). +- Appends `++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false`. +- `EXP_SUFFIX` gets a `notrain-` tag. + +Everything else (model, data, `PPS=8/GPP=8/GBS=64`, agent settings, container) is +unchanged, so the per-replica generation workload (`samples/replica = GBS / replicas += 64 / 32 = 2`) matches the full run. + +### How to verify the scaling is sound (wandb) +Compare runs at different generation sizes (vary `NUM_GEN_NODES`) within one wandb +group. The **per-replica** `generation_metrics/*` timelines should stay **flat** +(invariant) as you add replicas — not grow with scale: + +| metric | expectation across scale | +|--------|--------------------------| +| `generation_metrics/*inflight_batch_sizes` | flat, low (≈1–3 per replica) | +| `generation_metrics/*num_pending_samples` | ≈ 0 (no queue backlog) | +| `generation_metrics/*kv_cache_usage_perc` | flat (≈8–10%) | +| `generation_metrics/*generation_tokens` | flat per replica per window | +| worker-trace count | equals the replica count (`gen_gpus / vLLM_TP`) | + +> Note: SWE rollouts are **agent / tool-execution-bound** (each sample is a multi-turn +> OpenHands rollout in an apptainer sandbox), so per-replica inflight/KV stay low and +> total throughput scales sub-linearly with GPUs — that is expected, not a regression. +> Weights are frozen, so reward hovers around the init checkpoint's baseline (noisy on +> small per-step sample counts); this mode is for **throughput/scaling**, not learning. + +--- + +## 10. Generation-scaling sweep launcher (`run_grpo_swe2_scale_gen.sh`) + +For sweeping the number of vLLM generation replicas, use the second launcher: +`${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh`. It takes a **single +knob — `NUM_VLLM_REPLICAS` (R)** — and auto-derives nodes / `num_prompts_per_step` / +`train_global_batch_size` so the **per-replica generation workload stays constant** +(`samples/replica/step = 2`) across scales. Same model / data / config / container +as the baseline run. + +```bash +# preview the derived config without submitting +NUM_VLLM_REPLICAS=32 DRY_RUN=1 bash "${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh" + +# a sweep, all in one wandb group for comparison +for R in 16 32 64; do + NUM_VLLM_REPLICAS=$R WANDB_GROUP=swe-gen-scale-sweep \ + bash "${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh" +done +``` + +Derivation (with `GPP=8`, `vLLM_TP=2` → 4 replicas/node): + +| mode | `R` constraint | GEN nodes | TRAIN nodes | total | PPS | GBS | train parallelism | +|------|----------------|-----------|-------------|-------|-----|-----|-------------------| +| **linear** (default) | multiple of **16** | `R/4` | `R/4` (1:1) | `R/2` | `R/4` | `2R` | TP=4,EP=8,CP=4,PP=2 | +| **skip-train** (`SKIP_TRAINING=1`) | multiple of **4** | `R/4` | **1** | `R/4 + 1` | `R/4` | `2R` | TP=8,EP=8,CP=1,PP=1 | + +`R=32` (linear) reproduces the baseline shape exactly (16 nodes = 8 train + 8 gen, +PPS=8, GBS=64). The `R%16` requirement in linear mode comes from training scaling +linearly at TP×CP×PP=32 (train world `2R` must be divisible by 32); `SKIP_TRAINING=1` +pins training to one node (model-parallel 8) so `R` need only be a multiple of 4 — +enabling small scales like R=4 (2 nodes) / R=8 (3 nodes). See §9 for the no-op-train +semantics, and §9's wandb table for what to verify across the sweep. + +### Knobs (env vars) + +| Var | Default | Effect | +|-----|---------|--------| +| `NUM_VLLM_REPLICAS` | *(required)* | number of vLLM replicas (R) | +| `SKIP_TRAINING` | `0` | `1` = no-op training on 1 node (R%4); else linear-train (R%16) | +| `TRAIN_NODES` | derived | override training node count | +| `WANDB_GROUP` | `swe-gen-scale-linear` | wandb group (use one per sweep) | +| `MAX_NUM_STEPS` | *(unset)* | cap training steps (handy for a quick smoke) | +| `SBATCH_TIME` | `4:0:0` | SLURM walltime | +| `DRY_RUN` | `0` | `1` = print the derived config and exit (no `sbatch`) | + +Job id is written to `${REPO_ROOT}/latest_scale_gen_job_id.txt`. diff --git a/examples/swe_bench/REPRO_swe2_sglang.md b/examples/swe_bench/REPRO_swe2_sglang.md new file mode 100644 index 0000000000..4e922d9797 --- /dev/null +++ b/examples/swe_bench/REPRO_swe2_sglang.md @@ -0,0 +1,232 @@ +# Reproducing SWE2 Async-GRPO on **SGLang** (Qwen3-30B-A3B-Thinking) + +Self-contained guide to run the multi-turn SWE-bench agentic GRPO recipe with the +**SGLang** generation backend, at parity with the vLLM baseline (rollout completeness + +throughput) and at **training-grade per-token logprob parity** with vLLM. + +This is the SGLang counterpart of [`REPRO_swe2.md`](./REPRO_swe2.md) (the vLLM baseline). +Everything needed lives in **one clone** — you do **not** need to fetch any other PR. + +--- + +## 0. TL;DR + +```bash +# One clone has everything (RL + splice-fixed Gym + grafted SGLang backend). +git clone --recurse-submodules -b swe2-qwen-sglang git@github.com:Kh4L/NemoRL.git +cd NemoRL +export REPO_ROOT="$PWD" + +# Credentials (not shipped) — see §3. +export HF_HOME=/your/hf/home HF_TOKEN=... WANDB_API_KEY=... + +# 2-node smoke / parity (generation-only, no training): +SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang \ + bash examples/swe_bench/run_grpo_swe2_scale_gen.sh + +# Full convergence run on SGLang (16 nodes = 8 train + 8 gen, reproduces the baseline shape): +NUM_VLLM_REPLICAS=32 BACKEND=sglang \ + bash examples/swe_bench/run_grpo_swe2_scale_gen.sh +``` + +Expected: multi-turn rollouts complete (**8/8, contiguity failures 0**), **~193 gen tok/s** +with full CUDA graph (≈ vLLM), and SGLang↔vLLM per-token logprobs agree to within the model's +own bf16/MoE numerical noise floor (see §6). + +--- + +## 1. What this is + +| Item | Value | +|------|-------| +| Algorithm | Async GRPO (non-colocated generation) | +| Model | Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / ~3B active) | +| Generation backend | **SGLang** (vLLM is the baseline; this recipe swaps it for SGLang) | +| Init checkpoint | SWE1 `step_230_hf` (same as the vLLM baseline `dc3m70us`) | +| Env | `swe_agents` (OpenHands agent inside an apptainer/singularity sandbox) | +| Launcher | `examples/swe_bench/run_grpo_swe2_scale_gen.sh` (`BACKEND=sglang`) | +| Scheduler | SLURM (`sbatch` + `ray.sub`), enroot+pyxis container runtime | + +The goal: make **SGLang** a usable generation backend for this multi-turn SWE-bench recipe, +proven equivalent to vLLM at (a) rollout completeness, (b) throughput, and (c) per-token +logprobs (which feed GRPO importance ratios). + +--- + +## 2. Provenance — what's grafted vs. what's new + +This branch set is a **graft**, assembled so a single clone is runnable end-to-end: + +- **Base:** NeMo-RL `main` (already carries the *basic* SGLang backend). +- **Grafted from [NVIDIA-NeMo/RL#2447](https://github.com/NVIDIA-NeMo/RL/pull/2447) + (`zhw/mxfp8_support`):** the *enhanced* SGLang backend our 30B-MoE recipe needs — + Megatron→SGLang weight-refit (`megatron_sglang_weight_iterator.py`), non-colocated + weight update, router, fault-tolerance, and the heavier `sglang_worker` / `sglang_generation`. + #2447 is an open, evolving PR; this branch **pins a known-good graft of it** so you don't + have to track #2447 yourself. +- **New in this branch set (the genuinely novel work):** + 1. **★ Gym-proxy token-splicing contiguity fix** (the load-bearing piece, in the Gym fork + `responses_api_models/vllm_model/app.py`). Multi-turn SGLang rollouts broke a hard + prefix-stability assert in `nemo_gym.py` on ~every tool-using turn (48/48 failures). + Fix: build each turn's prompt as `prompt_{K-1} + gen_{K-1}(verbatim) + delta_K`, + splicing the prior assistant's **exact sampled token IDs** instead of re-tokenizing + (`_build_sglang_prompt_ids`, `_update_sglang_session_seq`, `_sglang_followup_fragment_ids`). + Also: SGLang native `/generate` with `return_logprob=True`, and `skip_special_tokens=False` + so `` (id 151668) survives the multi-turn re-feed. + 2. **SWE2 SGLang launcher path** — `BACKEND=sglang` in `run_grpo_swe2_scale_gen.sh`. + 3. **CUDA-graph perf** — full CUDA graph ON (piecewise off; it crashes on torch-2.10/sglang) + → **51 → ~193 tok/s, ≈ vLLM**. + 4. **Refit OOM / NCCL-deadlock mitigations** — `mem_fraction_static=0.55`, + `NRL_REFIT_BUFFER_MEMORY_RATIO=0.018`, `pause_generation_mode=retract`. +- **Parity instrumentation** (sentinel-gated, harmless when off): in-proxy + in-worker + teacher-force hooks used to *prove* logprob parity (§6). Not needed for training. + +Pinned SHAs: **RL `c88030f`**, **Gym `50586ec`** (auto-resolved as the submodule). + +--- + +## 3. Prerequisites + +### 3.1 Cluster / runtime +A SLURM cluster with **enroot + pyxis** (so `ray.sub` runs `srun --container-image` natively). +Validated on CW-DFW (`cw-dfw-cs-001`, H100). 2 nodes for the smoke/parity run; 16 nodes for +the full convergence run. + +### 3.2 Container (SWE training image — has the working hermes tool parser) +``` +/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs +``` +Wired via `CONTAINER` (overridable). It bakes mcore + apptainer; the launcher overlays your +clone's `nemo_rl/` and `3rdparty/Gym-workspace/Gym` over the baked copies, so **your checkout +is what runs**. + +### 3.3 Shared assets (absolute, world-readable on CW-DFW Lustre) +| Path | Purpose | +|------|---------| +| `…/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-…/step_230_hf` | init checkpoint | +| `…/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl` | train + val data | +| `…/spanev/swe2-repro/qwen3_swe_chat_template.jinja` | SGLang chat template (Qwen3 thinking) | +| per-instance `swebench_sweb.eval.x86_64.{instance_id}.sif` | SWE-bench sandbox images (resolved by `container_formatter` in the YAML) | + +The exact default paths are in the launcher (`MODEL_PATH`, `TRAIN_DATA_PATH`, `SGLANG_CHAT_TEMPLATE`); +override via env if your copies live elsewhere. + +### 3.4 Credentials (export yourself — not shipped) +`HF_HOME`, `HF_TOKEN` (gated model), `WANDB_API_KEY` (or `WANDB_MODE=offline`), +optionally `GITHUB_TOKEN` / `GITLAB_TOKEN`. + +--- + +## 4. How to run + +The single launcher is **`examples/swe_bench/run_grpo_swe2_scale_gen.sh`**. One knob, +`NUM_VLLM_REPLICAS` (R), derives nodes / batch sizes so per-replica work is constant. + +| Mode | Command | Footprint | +|------|---------|-----------| +| **Smoke / parity** (gen-only, no train) | `SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang bash …/run_grpo_swe2_scale_gen.sh` | 2 nodes (1 gen + 1 train no-op) | +| **Full convergence** (reproduces baseline shape) | `NUM_VLLM_REPLICAS=32 BACKEND=sglang bash …/run_grpo_swe2_scale_gen.sh` | 16 nodes (8 train + 8 gen) | +| **Preview only** | add `DRY_RUN=1` | none (prints derived config) | + +Job id is written to `${REPO_ROOT}/latest_scale_gen_job_id.txt`. Logs under +`${REPO_ROOT}/logs/swe_bench_scale/`. wandb project `swe-benchmark`. + +### SGLang-specific env toggles +| Var | Default | Effect | +|-----|---------|--------| +| `BACKEND` | `vllm` | set `sglang` to use the SGLang path | +| `SGLANG_DISABLE_CUDA_GRAPH` | `false` | `true` disables full CUDA graph (slower; ~51 tok/s) | +| `SGLANG_CHAT_TEMPLATE` | `…/qwen3_swe_chat_template.jinja` | Qwen3-thinking chat template | +| `TEMPERATURE` | `1.0` | sampling temperature (recipe trains at 1.0) | +| `UV_CACHE_DIR` | `/tmp/uv_cache` | set to **`/root/.cache/uv`** to reuse the container's prebuilt SGLang wheels and skip a ~40-min build | +| `SBATCH_ACCOUNT` / `SBATCH_PARTITION` | `nemotron_agents_dev` / `backfill` | SLURM account / partition | + +### What `BACKEND=sglang` injects (the generation overrides) +bf16; `dp=ep=pp=1` with TP via `num_gpus_per_engine=8`; `mem_fraction_static=0.55`; +`disable_piecewise_cuda_graph=true` + `disable_cuda_graph=${SGLANG_DISABLE_CUDA_GRAPH}`; +`tool_call_parser=hermes` + `reasoning_parser=qwen3-thinking`; the chat template; +`pause_generation_mode=retract`; router disabled; and the Gym proxy switched to the SGLang +engine path (`…vllm_model.engine=sglang`). + +--- + +## 5. Expected results (port parity) + +| Run | CUDA graph | gen tok/s | rollout | contiguity_fail | +|---|---|---|---|---| +| SGLang | OFF | ~51 | 30:29 | **0** (8/8) | +| **SGLang** | **ON** (default) | **~193** | **13:34** | **0** (8/8) | +| vLLM baseline | default | — | 10–16 min | 0 | + +Success markers (same as the vLLM baseline): non-zero `train:total_reward/mean` from step ~1, +logged Gym responses contain real `function_call` items, resolved rate climbs toward ~8%. + +--- + +## 6. (Optional) Reproduce the SGLang↔vLLM logprob parity + +The parity hooks are **committed and sentinel-gated** (no effect unless triggered), so a clean +clone can regenerate the parity numbers. They teacher-force both engines on identical token IDs +and compare per-token logprobs. Shared dir + scripts live under +`/lustre/…/spanev/swe2-repro/parity/` (capture `rollouts.jsonl` + `compare_forced.py`). + +```bash +P=/lustre/.../swe2-repro/parity # capture + scripts + sentinels +# Input: 52 recs / 27,493 tokens, derived deterministically from the 864-record capture +# (filter gen>0 & prompt+gen<=24k, shortest-first) -> rollouts_filtered16.jsonl + +# SGLang side: in-proxy /generate teacher-force +touch $P/SGLANG_TF_TRIGGER +SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang MAX_NUM_STEPS=1 \ + bash examples/swe_bench/run_grpo_swe2_scale_gen.sh # -> forced_sglang.jsonl, SGLANG_TF_DONE + +# vLLM side: in-worker engine teacher-force (fires post-refit on weight update) +touch $P/VLLM_ENGINE_TF_TRIGGER +SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=vllm MAX_NUM_STEPS=1 \ + bash examples/swe_bench/run_grpo_swe2_scale_gen.sh # -> forced_vllm.jsonl, VLLM_ENGINE_TF_DONE + +python3 $P/compare_forced.py --sglang $P/forced_sglang.jsonl --vllm $P/forced_vllm.jsonl +``` + +**Result (teacher-forced, all 27,493 tokens, real post-refit weights):** + +| metric | value | read | +|---|---|---| +| median \|Δ logprob\| | **1.38e-3** | training-grade at the typical token | +| p95 / p99 / max | 0.140 / 0.245 / 1.01 | real tail at high-entropy tokens | +| top-K KL median | 1.75e-4 | negligible | +| confident-token bucket [0,0.3) nats median | **3.6e-7** | engines essentially identical where confident | +| within-engine baseline (SGLang sampled-vs-forced) median | **1.24e-3** | the model's own bf16/MoE noise floor | + +**Verdict:** cross-engine median (1.38e-3) ≈ within-engine noise floor (1.24e-3) — **vLLM differs +from SGLang no more than SGLang differs from itself.** Swapping vLLM→SGLang is safe at the +logprob level. (Re-validated from a fresh clone on 2026-06-25; numbers match to within run-to-run +bf16/MoE noise.) + +Notes: vLLM's recipe server is chat-only (no `/v1/completions`), so its teacher-force runs +**in-process inside the worker** (`vllm_worker_async.py`), fired **post-refit** (the engine boots +with dummy weights and gets the real checkpoint via refit). SGLang's `/generate` supports +`logprob_start_len`, so its side runs in the Gym proxy. + +--- + +## 7. Gotchas (already handled, listed so you don't re-hit them) + +- **Multi-turn contiguity** — solved by the token-splice fix; do not "fix" it by re-tokenizing. +- **CUDA graph** — full graph works and is ~2× faster; only *piecewise* crashes (kept off). +- **Refit OOM / NCCL hang** — needs `mem_fraction_static=0.55` + refit bucket cap + + `pause_generation_mode=retract` (all baked into the launcher). +- **Slow first run** — set `UV_CACHE_DIR=/root/.cache/uv` to reuse baked SGLang wheels. +- **Recipe-managed engines are unreachable from the host** (pyxis container netns) — interact + only via the Gym proxy or in-worker hooks (which is what the parity instrumentation does). + +--- + +## 8. Where things live +| Thing | Location | +|---|---| +| This recipe (RL + launcher + parity hooks) | `Kh4L/NemoRL@swe2-qwen-sglang` (`c88030f`) | +| Splice-fixed + parity-hooked Gym | `Kh4L/NemoGym@swe2-sglang-graft` (`50586ec`, submodule) | +| vLLM baseline guide | [`REPRO_swe2.md`](./REPRO_swe2.md) | +| Parity capture + scripts + results | DFW `/lustre/.../spanev/swe2-repro/parity/` | +| Upstream home of the enhanced SGLang backend | [NVIDIA-NeMo/RL#2447](https://github.com/NVIDIA-NeMo/RL/pull/2447) | diff --git a/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml b/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml new file mode 100644 index 0000000000..3859b3083e --- /dev/null +++ b/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml @@ -0,0 +1,424 @@ +# ============================================================================ +# Async GRPO SWE RL Training: Qwen3-30B-A3B-Thinking-2507 +# +# Model: Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / 3B active, thinking) +# Train data: R2E-Gym (r2e-gym subset, 4518 samples) +# Eval data: SWE-bench Verified +# Mode: Async GRPO with non-colocated generation +# Entry: examples/nemo_gym/run_grpo_nemo_gym.py +# Env: swe_agents (OpenHands agent, singularity sandbox) +# +# Based on: baseline/nemo-rl-qwen-swe/grpo_qwen3_30b_thinking_swe.yaml +# Gym: main branch (nemo-rl-async-swe repo) +# ============================================================================ + +checkpointing: + enabled: true + checkpoint_dir: "results/grpo-qwen3-30b-thinking-swe-rl" + metric_name: "train:total_reward/mean" + higher_is_better: true + keep_top_k: 100 + save_period: 5 + checkpoint_must_save_by: "00:03:35:00" + model_save_format: "safetensors" + save_consolidated: false + save_optimizer: true + +grpo: + num_prompts_per_step: 16 + num_generations_per_prompt: 16 + num_val_generations_per_prompt: 1 + max_rollout_turns: 1 + max_num_epochs: 100 + max_num_steps: 1000000 + normalize_rewards: true + use_leave_one_out_baseline: true + advantage_clip_low: -100 + advantage_clip_high: 100 + val_period: 10 + val_at_start: false + val_at_end: false + overlong_filtering: true + max_val_samples: null + val_batch_size: 256 + seed: 42 + invalid_tool_call_strategy: "" + + use_dynamic_sampling: false + dynamic_sampling_max_gen_batches: 10 + batch_multiplier: 1 + + penalize_invalid_tool_call: true + invalid_tool_call_advantage: -5.0 + penalize_malformed_thinking: true + malformed_thinking_advantage: -5.0 + + reward_shaping: + enabled: false + overlong_buffer_length: 128 + overlong_buffer_penalty: 1 + max_response_length: ${policy.max_total_sequence_length} + stop_properly_penalty_coef: null + reward_scaling: + enabled: false + source_min: 0.0 + source_max: 1.0 + target_min: 0.0 + target_max: 1.0 + + async_grpo: + enabled: true + max_trajectory_age_steps: 1 + in_flight_weight_updates: true + recompute_kv_cache_after_weight_updates: false + + seq_logprob_error_threshold: 2 + +loss_fn: + reference_policy_kl_penalty: 0.0 + reference_policy_kl_type: "k3" + kl_input_clamp_value: null + kl_output_clamp_value: null + ratio_clip_min: 0.2 + ratio_clip_max: 0.28 + ratio_clip_c: null + use_on_policy_kl_approximation: true + use_importance_sampling_correction: true + truncated_importance_sampling_ratio: 5.0 + truncated_importance_sampling_ratio_min: null + truncated_importance_sampling_type: tis + sequence_level_importance_ratios: false + token_level_loss: true + force_on_policy_ratio: true + use_kl_in_reward: false + +policy: + model_name: "/lustre/fsw/portfolios/llmservice/users/igitman/hf_models/Qwen3-30B-A3B-Thinking-2507" + tokenizer: + name: ${policy.model_name} + chat_template_kwargs: + enable_thinking: true + hf_config_overrides: {} + train_global_batch_size: 256 + train_micro_batch_size: 1 + generation_batch_size: 64 + logprob_batch_size: 1 + max_total_sequence_length: 131072 + precision: "bfloat16" + logprob_chunk_size: 2048 + offload_optimizer_for_logprob: false + + dtensor_cfg: + _v2: true + enabled: false + cpu_offload: False + sequence_parallel: false + activation_checkpointing: false + tensor_parallel_size: 1 + context_parallel_size: 1 + custom_parallel_plan: null + + megatron_cfg: + enabled: true + gradient_accumulation_fusion: false + empty_unused_memory_level: 1 + activation_checkpointing: true + tensor_model_parallel_size: 2 + expert_tensor_parallel_size: 1 + expert_model_parallel_size: 8 + pipeline_model_parallel_size: 2 + num_layers_in_first_pipeline_stage: null + num_layers_in_last_pipeline_stage: null + context_parallel_size: 4 + pipeline_dtype: ${policy.precision} + sequence_parallel: true + freeze_moe_router: true + moe_router_dtype: "fp32" + moe_router_load_balancing_type: "none" + moe_router_bias_update_rate: 1.0e-3 + moe_permute_fusion: true + moe_enable_deepep: false + moe_token_dispatcher_type: "alltoall" + moe_aux_loss_coeff: 0.0 + moe_router_enable_expert_bias: true + moe_shared_expert_overlap: false + apply_rope_fusion: True + bias_activation_fusion: False + defer_fp32_logits: True + moe_per_layer_logging: True + + optimizer: + optimizer: "adam" + lr: 1.0e-6 + min_lr: 1.0e-6 + weight_decay: 0.0 + bf16: true + fp16: false + params_dtype: "float32" + adam_beta1: 0.9 + adam_beta2: 0.999 + adam_eps: 1e-8 + sgd_momentum: 0.9 + use_distributed_optimizer: true + use_precision_aware_optimizer: true + clip_grad: ${policy.max_grad_norm} + optimizer_cpu_offload: false + optimizer_offload_fraction: 0.0 + + scheduler: + start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} + end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} + weight_decay_incr_style: "constant" + lr_decay_style: "constant" + lr_decay_iters: 1000000 + lr_warmup_iters: 0 + lr_warmup_init: 0 + + distributed_data_parallel_config: + grad_reduce_in_fp32: false + overlap_grad_reduce: false + overlap_param_gather: false + use_custom_fsdp: false + data_parallel_sharding_strategy: "optim_grads_params" + + mtp_loss_scaling_factor: 0.0 + mtp_use_repeated_layer: false + mtp_num_layers: 0 + mtp_detach_heads: false + + fp8_cfg: + enabled: false + fp8: "e4m3" + fp8_recipe: "blockwise" + fp8_param: false + + env_vars: null + + dynamic_batching: + enabled: False + train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}} + logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}} + sequence_length_round: 64 + + sequence_packing: + enabled: True + train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}} + logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}} + algorithm: "modified_first_fit_decreasing" + sequence_length_round: 64 + + make_sequence_length_divisible_by: 8 + max_grad_norm: 1.0 + + optimizer: null + scheduler: null + + generation: + port_range_low: 11001 + port_range_high: 15000 + backend: "vllm" + max_new_tokens: ${policy.max_total_sequence_length} + temperature: 1.0 + top_p: 1.0 + top_k: null + stop_token_ids: null + stop_strings: null + vllm_cfg: + enable_prefix_caching: true + async_engine: true + precision: ${policy.precision} + kv_cache_dtype: "auto" + tensor_parallel_size: 2 + pipeline_parallel_size: 1 + expert_parallel_size: 1 + gpu_memory_utilization: 0.8 + max_model_len: ${policy.max_total_sequence_length} + enforce_eager: False + enforce_monotonicity: false + use_deep_gemm: False + num_last_layers_in_bf16: 0 + num_first_layers_in_bf16: 0 + enable_vllm_metrics_logger: true + vllm_metrics_logger_interval: 0.5 + expose_http_server: true + skip_tokenizer_init: false + enable_thinking: true + http_server_serving_chat_kwargs: + enable_auto_tools: true + tool_parser: hermes + reasoning_parser: deepseek_r1 + chat_template: | + {%- if tools %} + {{- '<|im_start|>system\n' }} + {%- if messages[0].role == 'system' %} + {{- messages[0].content + '\n\n' }} + {%- endif %} + {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }} + {%- for tool in tools %} + {{- "\n" }} + {{- tool | tojson }} + {%- endfor %} + {{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{\"name\": , \"arguments\": }\n<|im_end|>\n" }} + {%- else %} + {%- if messages[0].role == 'system' %} + {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }} + {%- endif %} + {%- endif %} + {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %} + {%- for message in messages[::-1] %} + {%- set index = (messages|length - 1) - loop.index0 %} + {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('') and message.content.endswith('')) %} + {%- set ns.multi_step_tool = false %} + {%- set ns.last_query_index = index %} + {%- endif %} + {%- endfor %} + {%- for message in messages %} + {%- if message.content is string %} + {%- set content = message.content %} + {%- else %} + {%- set content = '' %} + {%- endif %} + {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} + {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} + {%- elif message.role == "assistant" %} + {%- set reasoning_content = '' %} + {%- if message.reasoning_content is string %} + {%- set reasoning_content = message.reasoning_content %} + {%- else %} + {%- if '' in content %} + {%- set reasoning_content = content.split('')[0].rstrip('\n').split('')[-1].lstrip('\n') %} + {%- set content = content.split('')[-1].lstrip('\n') %} + {%- endif %} + {%- endif %} + {%- if reasoning_content %} + {{- '<|im_start|>' + message.role + '\n\n' + reasoning_content.strip('\n') + '\n\n\n' + content.lstrip('\n') }} + {%- else %} + {{- '<|im_start|>' + message.role + '\n' + content }} + {%- endif %} + {%- if message.tool_calls %} + {%- for tool_call in message.tool_calls %} + {%- if (loop.first and content) or (not loop.first) %} + {{- '\n' }} + {%- endif %} + {%- if tool_call.function %} + {%- set tool_call = tool_call.function %} + {%- endif %} + {{- '\n{"name": "' }} + {{- tool_call.name }} + {{- '", "arguments": ' }} + {%- if tool_call.arguments is string %} + {{- tool_call.arguments }} + {%- else %} + {{- tool_call.arguments | tojson }} + {%- endif %} + {{- '}\n' }} + {%- endfor %} + {%- endif %} + {{- '<|im_end|>\n' }} + {%- elif message.role == "tool" %} + {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %} + {{- '<|im_start|>user' }} + {%- endif %} + {{- '\n\n' }} + {{- content }} + {{- '\n' }} + {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} + {{- '<|im_end|>\n' }} + {%- endif %} + {%- endif %} + {%- endfor %} + {%- if add_generation_prompt %} + {{- '<|im_start|>assistant\n\n' }} + {%- endif %} + default_chat_template_kwargs: + enable_thinking: true + truncate_history_thinking: false + + vllm_kwargs: + mamba_ssm_cache_dtype: "float32" + compilation_config: + cudagraph_capture_sizes: [1,2,4,8,16,32,64] + + colocated: + enabled: false + resources: + gpus_per_node: 8 + num_nodes: 4 + +data: + max_input_seq_length: null + shuffle: false + num_workers: 1 + use_multiple_dataloader: false + train: + data_path: "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl" + validation: + data_path: "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl" + default: + dataset_name: NemoGymDataset + env_name: "nemo_gym" + prompt_file: null + system_prompt_file: null + processor: "nemo_gym_data_processor" + +env: + should_use_nemo_gym: true + should_log_nemo_gym_responses: false + nemo_gym: + skip_venv_if_present: true + port_range_low: 15001 + port_range_high: 20000 + config_paths: + - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml + - responses_api_agents/swe_agents/configs/swebench_openhands_training.yaml + swe_agents_train: + responses_api_agents: + swe_agents: + agent_max_turns: 100 + concurrency: 768 + swebench_agent_timeout: 3600 + run_with_mixed_prompts: true + dataset_path: ${data.train.data_path} + container_formatter: + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif" + swe_agents_val: + responses_api_agents: + swe_agents: + agent_max_turns: 200 + concurrency: 768 + swebench_agent_timeout: 3600 + dataset_path: ${data.validation.data_path} + container_formatter: + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif" + - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif" + use_absolute_ip: true + +logger: + log_dir: "logs" + num_val_samples_to_print: 0 + wandb_enabled: true + tensorboard_enabled: false + mlflow_enabled: false + monitor_gpus: true + swanlab_enabled: false + wandb: + project: "ruit-nemo-rl" + name: "qwen3-30b-thinking-swe-rl" + tensorboard: {} + mlflow: + experiment_name: "qwen3-30b-thinking-swe-rl" + run_name: "qwen3-30b-thinking-swe-rl" + gpu_monitoring: + collection_interval: 10 + flush_interval: 10 + +cluster: + gpus_per_node: 8 + num_nodes: 16 diff --git a/examples/swe_bench/run_grpo_repro_baseline_swe2.sh b/examples/swe_bench/run_grpo_repro_baseline_swe2.sh new file mode 100644 index 0000000000..52e99a0217 --- /dev/null +++ b/examples/swe_bench/run_grpo_repro_baseline_swe2.sh @@ -0,0 +1,384 @@ +#!/bin/bash +# ============================================================================ +# REPRO of baseline's SUCCESSFUL SWE2 run (wandb nvidia/binhu-nemo-rl/dc3m70us). +# +# Goal: reproduce the working setup that resolves ~8% from step 1, to confirm +# the zero-reward issue was the container/vLLM (broken hermes tool parser), not +# the model or config. +# +# Matched to dc3m70us: +# Code: NeMo-RL @ commit a760f1c (current dir) +# Container: ruit-swe_bench (mcore + apptainer; vLLM where the hermes tool +# parser works -> agent makes real tool calls) +# COMMAND: baseline-style (uv run --frozen, NO --extra mcore, +# NRL_IGNORE_VERSION_MISMATCH=1, NEMO_GYM_SKIP_VENV_IF_PRESENT=1) +# Parallel: TP=2, EP=8, CP=4, PP=2 (dc3m70us used TP=2, NOT TP=4) +# Model: SWE1 step_230_hf +# +# Kept as ruit's: account, env source, cache, wandb project (swe-benchmark). +# +# Usage: bash examples/swe_bench/run_grpo_repro_baseline_swe2.sh +# ============================================================================ + +set -e + +# ============================ Paths ============================ +# Auto-detected from this script's location (examples/swe_bench/), so the run +# works from any clone of the repo. Override by exporting REPO_ROOT. +REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}" +CONFIG_FILE="${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml" +CHECKPOINT_ROOT="${REPO_ROOT}/results" +TRAIN_DATA_PATH="/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl" +VAL_DATA_PATH="${TRAIN_DATA_PATH}" +# SWE1 step_230 HF checkpoint (exactly what dc3m70us trained from). +DEFAULT_MODEL_PATH="/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf" +MODEL_PATH="${1:-${MODEL_PATH:-${DEFAULT_MODEL_PATH}}}" + +# ================ Container and mount config ================ +# SWE training container (mcore + apptainer baked in, working hermes tool parser +# so the agent emits real tool calls). +export CONTAINER=${CONTAINER:-/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs} +GYM_CODE="${REPO_ROOT}/3rdparty/Gym-workspace/Gym" +export MOUNTS="/lustre:/lustre,$PWD:$PWD,${GYM_CODE}:/opt/nemo-rl/3rdparty/Gym-workspace/Gym" + +# ======================= Cluster / resources ======================= +# SKIP_TRAINING=1 -> generation-only benchmark: training is a no-op (no optimizer, +# weights frozen, refit every step + keep-alive matmul), pinned to ONE node. +SKIP_TRAINING="${SKIP_TRAINING:-0}" +NUM_ACTOR_NODES=${NUM_NODES:-16} +NUM_GENERATION_NODES=${NUM_GEN_NODES:-8} # only used in async (non-colocated) mode +if [ "${SKIP_TRAINING}" = "1" ]; then + # no real training -> 1 training node suffices (train_nodes = total - gen). + NUM_ACTOR_NODES=$(( NUM_GENERATION_NODES + 1 )) +fi +NUM_GPU=8 +export GPUS_PER_NODE=${NUM_GPU} +export CPUS_PER_WORKER=114 + +# ============================ Parallelism ============================ +# SKIP_TRAINING -> training must fit 1 node, so model_parallel = TP*CP*PP <= 8. +if [ "${SKIP_TRAINING}" = "1" ]; then + TP=8; EP=8; CP=1; PP=1 # model_parallel=8 (fits 1 node), train_DP=1 +else + TP=4; EP=8; CP=4; PP=2 # dc3m70us-style real training +fi +VLLM_TP=2 +MIN_PAD=1 +if [ ${CP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * CP * 2)); fi +if [ ${TP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * TP)); fi +MAKE_SEQ_DIVISIBLE_BY=${MIN_PAD} + +# ===================== Sequence length & packing ===================== +SEQLEN=131072 +SEQUENCE_PACKING=True + +# ================= Sync/Async mode & async GRPO settings ================= +ASYNC_GRPO_ENABLED=True +MAX_TRAJECTORY_AGE_STEPS=1 +FORCE_ON_POLICY_RATIO=True +INFLIGHT_WEIGHT_UPDATE=True +RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES=False +SEQ_LOGPROB_ERROR_THRESHOLD=null +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + COLOCATED_ENABLED=False + VLLM_GPU_UTIL=0.8 + OVERLAP_GRAD_REDUCE=False + ADVANTAGE_CLIP_LOW=-100 + ADVANTAGE_CLIP_HIGH=100 + TIS_THRESHOLD=5 +else + COLOCATED_ENABLED=True + VLLM_GPU_UTIL=0.5 + OVERLAP_GRAD_REDUCE=True +fi + +# ========================= GRPO / sampling ========================= +PPS=8 +GPP=8 +GBS=64 +NORMALIZE_REWARDS=True +OVERLONG_FILTERING=True + +# ========================== Loss function ========================== +KL=0 +CLIP_MIN=0.2 +CLIP_MAX=0.28 +USE_ON_POLICY_KL_APPROXIMATION=True +IMPORTANCE_SAMPLING_CORRECTION=True +SEQ_LEVEL_IS=False +TOKEN_LEVEL_LOSS=True + +# ============================ Optimizer ============================ +LR="1e-06" + +# =============================== MoE =============================== +MOE_FREEZE_ROUTER=True +MOE_PERMUTE_FUSION=True +MOE_ENABLE_DEEPEP=False +MOE_TOKEN_DISPATCHER_TYPE="alltoall" +MOE_AUX_LOSS_COEFF=0 +MOE_ROUTER_LOAD_BALANCING_TYPE="none" +MOE_ROUTER_BIAS_UPDATE_RATE="1e-3" + +# ======================= Generation / vLLM ======================= +TEMPERATURE=1.0 + +# =================== Checkpointing & validation =================== +SAVE_PERIOD=5 +VAL_PERIOD=1000 +KEEP_TOP_K=2 + +# ============================ SWE agent ============================ +AGENT_MAX_TURNS=200 +AGENT_TIMEOUT=1800 + +# ============================== Logging ============================== +WANDB_PROJ="swe-benchmark" +# Log full trajectories to wandb so we can verify function_call items appear. +LOG_GYM_RESPONSES=true + +# ========================= SLURM submission ========================= +SBATCH_ACCOUNT="nemotron_sw_post" +SBATCH_PARTITION="batch" +SBATCH_TIME="4:0:0" + +# ========================= Experiment naming ========================= +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + SYNC_MODE="async-age${MAX_TRAJECTORY_AGE_STEPS}" +else + SYNC_MODE="sync" +fi +MODE_TAG="" +if [ "${SKIP_TRAINING}" = "1" ]; then MODE_TAG="notrain-"; fi +EXP_SUFFIX="${EXP_SUFFIX:-repro-baseline-swe2-${MODE_TAG}${SYNC_MODE}-pps${PPS}-gpp${GPP}-gbs${GBS}-lr${LR}-tp${TP}}" +WANDB_NAME="${EXP_SUFFIX}" +CHECKPOINT_DIR="${CHECKPOINT_ROOT}/${EXP_SUFFIX}" +SNAPSHOT_DIR="${REPO_ROOT}" + +mkdir -p "${CHECKPOINT_DIR}" + +# ============= Unified SLURM/Ray log location ============= +export BASE_LOG_DIR="${BASE_LOG_DIR:-${SNAPSHOT_DIR}/logs/slurm}" +mkdir -p "${BASE_LOG_DIR}" + +# ========================= Environment variables ========================= +# NOTE: credentials have been removed from this shared copy. Before running, +# export the following yourself (e.g. from your own env script): +# HF_HOME=... HF_TOKEN=... (HuggingFace cache + token) +export HF_DATASETS_CACHE="${HF_DATASETS_CACHE:-${HF_HOME}/datasets}" +export UV_CACHE_DIR=/tmp/uv_cache +export UV_LOCK_TIMEOUT=3600 +export RAY_DEDUP_LOGS=1 +export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt +export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt +export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt +export OMP_NUM_THREADS=16 + +# ========================= Node-local cache config ========================= +# Defaults under your own $HOME so you don't write into anyone else's dir. +PERSISTENT_CACHE="${PERSISTENT_CACHE:-${HOME}/.cache/qwen3_30b_thinking_swe_repro_baseline}" +export LUSTRE_VLLM_CACHE="${PERSISTENT_CACHE}/vllm_compile_cache" +export LUSTRE_INDUCTOR_CACHE="${PERSISTENT_CACHE}/inductor_cache" +export LUSTRE_TRITON_CACHE="${PERSISTENT_CACHE}/triton_cache" +export NRL_VLLM_LOCAL_CACHE_DIR="/tmp/nemo_rl_vllm_cache" +export NRL_VLLM_CACHE_SEED_DIR="/tmp/nemo_rl_vllm_cache_warm" +export INDUCTOR_CACHE_DIR="/tmp/nemo_rl_inductor_cache" +export TRITON_CACHE_DIR="/tmp/nemo_rl_triton_cache" +export CACHE_SYNC_FREQUENCY=120 +mkdir -p "${LUSTRE_VLLM_CACHE}" "${LUSTRE_INDUCTOR_CACHE}" "${LUSTRE_TRITON_CACHE}" + +# ============================== Summary ============================== +echo "==========================================" +echo "REPRO of baseline dc3m70us | Experiment: ${EXP_SUFFIX}" +echo "Container: ruit-swe_bench (mcore+apptainer) | SKIP_TRAINING=${SKIP_TRAINING}" +echo "Mode: ${SYNC_MODE}, Colocated: ${COLOCATED_ENABLED}" +echo "Nodes: ${NUM_ACTOR_NODES} total (train=$(( NUM_ACTOR_NODES - NUM_GENERATION_NODES )), gen=${NUM_GENERATION_NODES}), GPUs/node: ${NUM_GPU}" +echo "Parallelism: TP=${TP}, EP=${EP}, CP=${CP}, PP=${PP}, vLLM_TP=${VLLM_TP}, pad=${MAKE_SEQ_DIVISIBLE_BY}" +echo "Training: PPS=${PPS}, GPP=${GPP}, GBS=${GBS}, LR=${LR}" +echo "Model: ${MODEL_PATH}" +echo "Checkpoint: ${CHECKPOINT_DIR}" +echo "==========================================" + +cd "${SNAPSHOT_DIR}" + +# ================ SETUP_COMMAND (baseline's: install apptainer + seed caches + uv sync) ================ +read -r -d '' SETUP_COMMAND </dev/null || true +RET=1 +RETRIES=3 +for attempt in \$(seq 1 \$RETRIES); do + if command -v apptainer >/dev/null 2>&1 || command -v singularity >/dev/null 2>&1; then + echo "[SETUP] singularity/apptainer already available" + RET=0 + break + fi + cd /tmp && \ + wget --no-check-certificate -q https://github.com/apptainer/apptainer/releases/download/v1.3.1/apptainer_1.3.1_amd64.deb && \ + apt install -y ./apptainer_1.3.1_amd64.deb && \ + ln -sf /usr/bin/apptainer /usr/bin/singularity + if command -v apptainer >/dev/null 2>&1; then + echo "[SETUP] apptainer installed successfully" + RET=0 + break + fi + echo "[SETUP] apptainer install attempt \$attempt failed, retrying..." + sleep 10 +done +if [ \$RET -ne 0 ]; then + echo "[SETUP] WARNING: apptainer installation failed after \$RETRIES attempts" +fi + +echo "[CACHE SEED] Clearing stale /tmp caches and seeding from Lustre..." +rm -rf /tmp/nemo_rl_vllm_cache /tmp/nemo_rl_vllm_cache_* +rm -rf "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}" +mkdir -p "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}" + +find "${LUSTRE_INDUCTOR_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true +find "${LUSTRE_TRITON_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true + +_seed_cache() { + local lustre="\$1" local_dir="\$2" name="\$3" + if [ -d "\$lustre" ] && [ "\$(ls -A "\$lustre" 2>/dev/null)" ]; then + rsync -a --exclude '.tmp_*' "\$lustre/" "\$local_dir/" 2>/dev/null \ + && echo "[CACHE SEED] \$name: seeded from Lustre" \ + || echo "[CACHE SEED] \$name: seed failed (non-fatal)" + else + echo "[CACHE SEED] \$name: no warm cache on Lustre yet" + fi +} + +_seed_cache "${LUSTRE_INDUCTOR_CACHE}" "${INDUCTOR_CACHE_DIR}" "Inductor" +_seed_cache "${LUSTRE_TRITON_CACHE}" "${TRITON_CACHE_DIR}" "Triton" +echo "[CACHE SEED] Done." + +UV_HTTP_TIMEOUT=3600 \ + uv sync --frozen --extra mcore +SETUPEOF +export SETUP_COMMAND + +# ================ Training command (baseline-style: uv run --frozen, no --extra mcore) ================ +export COMMAND="NRL_VLLM_USE_V1=1 \ + NRL_WG_USE_RAY_REF=1 \ + HF_HOME=${HF_HOME} \ + HF_DATASETS_CACHE=${HF_DATASETS_CACHE} \ + UV_CACHE_DIR=${UV_CACHE_DIR} \ + VLLM_ATTENTION_BACKEND=FLASH_ATTN \ + VLLM_CACHE_ROOT=${LUSTRE_VLLM_CACHE} \ + DG_JIT_CACHE_DIR=${LUSTRE_VLLM_CACHE}/deep_gemm \ + VLLM_DEEP_GEMM_WARMUP=skip \ + NRL_FORCE_REBUILD_VENVS=false \ + NRL_IGNORE_VERSION_MISMATCH=1 \ + RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 \ + UV_HTTP_TIMEOUT=3600 \ + UV_LOCK_TIMEOUT=900 \ + TORCH_CUDA_ARCH_LIST='9.0 10.0' \ + NEMO_GYM_SKIP_VENV_IF_PRESENT=1 \ + uv run --frozen --extra mcore ./examples/nemo_gym/run_grpo_nemo_gym.py \ + --config=${CONFIG_FILE} \ + cluster.num_nodes=${NUM_ACTOR_NODES} \ + cluster.gpus_per_node=${NUM_GPU} \ + ++data.train.data_path=${TRAIN_DATA_PATH} \ + ++data.validation.data_path=${VAL_DATA_PATH} \ + grpo.num_prompts_per_step=${PPS} \ + grpo.num_generations_per_prompt=${GPP} \ + grpo.val_at_start=False \ + grpo.normalize_rewards=${NORMALIZE_REWARDS} \ + grpo.overlong_filtering=${OVERLONG_FILTERING} \ + grpo.val_period=${VAL_PERIOD} \ + grpo.seq_logprob_error_threshold=${SEQ_LOGPROB_ERROR_THRESHOLD} \ + grpo.async_grpo.enabled=${ASYNC_GRPO_ENABLED} \ + grpo.async_grpo.in_flight_weight_updates=${INFLIGHT_WEIGHT_UPDATE} \ + grpo.async_grpo.recompute_kv_cache_after_weight_updates=${RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES} \ + grpo.async_grpo.max_trajectory_age_steps=${MAX_TRAJECTORY_AGE_STEPS} \ + env.should_log_nemo_gym_responses=${LOG_GYM_RESPONSES} \ + policy.generation.colocated.enabled=${COLOCATED_ENABLED} \ + policy.model_name=${MODEL_PATH} \ + policy.max_total_sequence_length=${SEQLEN} \ + policy.dynamic_batching.enabled=False \ + policy.train_global_batch_size=${GBS} \ + policy.make_sequence_length_divisible_by=${MAKE_SEQ_DIVISIBLE_BY} \ + policy.offload_optimizer_for_logprob=true \ + policy.sequence_packing.enabled=${SEQUENCE_PACKING} \ + policy.megatron_cfg.tensor_model_parallel_size=${TP} \ + policy.megatron_cfg.expert_model_parallel_size=${EP} \ + policy.megatron_cfg.context_parallel_size=${CP} \ + policy.megatron_cfg.pipeline_model_parallel_size=${PP} \ + policy.megatron_cfg.sequence_parallel=True \ + policy.megatron_cfg.bias_activation_fusion=False \ + policy.megatron_cfg.distributed_data_parallel_config.overlap_grad_reduce=${OVERLAP_GRAD_REDUCE} \ + policy.megatron_cfg.moe_permute_fusion=${MOE_PERMUTE_FUSION} \ + policy.megatron_cfg.moe_enable_deepep=${MOE_ENABLE_DEEPEP} \ + policy.megatron_cfg.moe_token_dispatcher_type=${MOE_TOKEN_DISPATCHER_TYPE} \ + policy.megatron_cfg.moe_aux_loss_coeff=${MOE_AUX_LOSS_COEFF} \ + policy.megatron_cfg.moe_router_load_balancing_type=${MOE_ROUTER_LOAD_BALANCING_TYPE} \ + policy.megatron_cfg.moe_router_bias_update_rate=${MOE_ROUTER_BIAS_UPDATE_RATE} \ + policy.megatron_cfg.freeze_moe_router=${MOE_FREEZE_ROUTER} \ + policy.megatron_cfg.optimizer.lr=${LR} \ + policy.megatron_cfg.optimizer.min_lr=${LR} \ + policy.megatron_cfg.optimizer.weight_decay=0 \ + policy.megatron_cfg.empty_unused_memory_level=2 \ + policy.megatron_cfg.activation_checkpointing=True \ + policy.generation.temperature=${TEMPERATURE} \ + policy.generation.vllm_cfg.tensor_parallel_size=${VLLM_TP} \ + policy.generation.vllm_cfg.gpu_memory_utilization=${VLLM_GPU_UTIL} \ + policy.generation.vllm_cfg.skip_tokenizer_init=False \ + loss_fn.reference_policy_kl_penalty=${KL} \ + loss_fn.ratio_clip_min=${CLIP_MIN} \ + loss_fn.ratio_clip_max=${CLIP_MAX} \ + loss_fn.use_on_policy_kl_approximation=${USE_ON_POLICY_KL_APPROXIMATION} \ + loss_fn.use_importance_sampling_correction=${IMPORTANCE_SAMPLING_CORRECTION} \ + loss_fn.sequence_level_importance_ratios=${SEQ_LEVEL_IS} \ + loss_fn.token_level_loss=${TOKEN_LEVEL_LOSS} \ + loss_fn.force_on_policy_ratio=${FORCE_ON_POLICY_RATIO} \ + checkpointing.checkpoint_dir=${CHECKPOINT_DIR} \ + checkpointing.save_period=${SAVE_PERIOD} \ + checkpointing.keep_top_k=${KEEP_TOP_K} \ + ++checkpointing.metric_name=train:total_reward/mean \ + ++checkpointing.checkpoint_must_save_by=00:03:35:00 \ + logger.wandb_enabled=True \ + logger.wandb.name=${WANDB_NAME} \ + logger.wandb.project=${WANDB_PROJ}" + +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + export COMMAND="${COMMAND} \ + policy.generation.colocated.resources.num_nodes=${NUM_GENERATION_NODES} \ + policy.generation.colocated.resources.gpus_per_node=${NUM_GPU} \ + grpo.advantage_clip_low=${ADVANTAGE_CLIP_LOW} \ + grpo.advantage_clip_high=${ADVANTAGE_CLIP_HIGH} \ + loss_fn.truncated_importance_sampling_ratio=${TIS_THRESHOLD} \ + env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \ + env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \ + env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \ + env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT}" +fi + +# Generation-only benchmark: no-op training (no optimizer) + disable checkpoint saving. +if [ "${SKIP_TRAINING}" = "1" ]; then + export COMMAND="${COMMAND} ++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false" +fi + +# ================ Submit job ================ +sbatch \ + --nodes="${NUM_ACTOR_NODES}" \ + --account="${SBATCH_ACCOUNT}" \ + --job-name="${WANDB_NAME}" \ + --partition="${SBATCH_PARTITION}" \ + --time="${SBATCH_TIME}" \ + --gres=gpu:${NUM_GPU} \ + --output="${BASE_LOG_DIR}/slurm-%j.out" \ + --exclusive \ + --dependency=singleton \ + --comment='{"OccupiedIdleGPUsJobReaper":{"exemptIdleTimeMins":"180","reason":"data_loading","description":"Async GRPO SWE2 repro of baseline dc3m70us"}}' \ + ray.sub | tee /dev/stderr | grep -o '[0-9]\+' > latest_repro_baseline_job_id.txt + +JOB_ID="$(cat latest_repro_baseline_job_id.txt)" +echo "==========================================" +echo "Job submitted: ${EXP_SUFFIX}" +echo "Job ID: ${JOB_ID}" +echo "Monitor with: squeue -j ${JOB_ID}" +echo "Ray/SLURM logs: ${BASE_LOG_DIR}/${JOB_ID}-logs/" +echo "Checkpoints: ${CHECKPOINT_DIR}/" +echo "==========================================" + +cd - > /dev/null diff --git a/examples/swe_bench/run_grpo_swe2_scale_gen.sh b/examples/swe_bench/run_grpo_swe2_scale_gen.sh new file mode 100644 index 0000000000..23533985de --- /dev/null +++ b/examples/swe_bench/run_grpo_swe2_scale_gen.sh @@ -0,0 +1,502 @@ +#!/bin/bash +# ============================================================================ +# GENERATION-SCALING launcher for async SWE GRPO (derived from +# run_grpo_repro_bihu_swe2.sh / bihu dc3m70us). +# +# Single knob: NUM_VLLM_REPLICAS (R) -> number of vLLM generation replicas. +# Everything else is auto-derived to hold these invariants constant so that +# runs at different R are directly comparable: +# - per generation-replica workload : samples/replica/step = 2 +# - per training-GPU workload : GBS / train_DP = 32 +# - train:gen node ratio : 1:1 (matches the bihu 8+8 baseline) +# +# Derivation (REPLICAS_PER_NODE = gpus_per_node / VLLM_TP = 8/2 = 4): +# GEN_NODES = R / 4 +# TRAIN_NODES = R / 4 (linear follow; override with TRAIN_NODES=) +# TOTAL_NODES = TRAIN_NODES+GEN_NODES = R/2 -> sbatch --nodes & cluster.num_nodes +# PPS = 2*R / GPP = R/4 +# GBS = PPS*GPP = 2*R (force_on_policy_ratio requires ==) +# CONCURRENCY = max(768, GBS*age) +# R must be a multiple of 16 (train world = 2R must satisfy Megatron +# model-parallel & expert-parallel divisibility; gen must fill whole nodes). +# R=32 exactly reproduces the bihu repro (16 nodes = 8+8, PPS=8, GBS=64). +# +# All runs of this sweep share one wandb group (WANDB_GROUP) under project +# swe-benchmark for easy comparison. +# +# Usage: +# NUM_VLLM_REPLICAS=64 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh +# NUM_VLLM_REPLICAS=64 DRY_RUN=1 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh # print config, no submit +# SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh # generation-only (no-op train, 1 node, R%4) +# Optional env: SKIP_TRAINING, TRAIN_NODES, WANDB_GROUP, EXP_SUFFIX, MODEL_PATH, CONTAINER, +# MAX_NUM_STEPS, SBATCH_TIME, PERSISTENT_CACHE, BASE_LOG_DIR +# Credentials are NOT sourced here — export HF_HOME / HF_TOKEN / WANDB_API_KEY yourself. +# ============================================================================ + +set -e + +# ============================ Paths ============================ +# Auto-detected from this script's location (examples/swe_bench/), so it works from +# any clone of the repo. Override by exporting REPO_ROOT. +REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}" +CONFIG_FILE="${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml" +CHECKPOINT_ROOT="${REPO_ROOT}/results" +TRAIN_DATA_PATH="/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl" +VAL_DATA_PATH="${TRAIN_DATA_PATH}" +# SWE1 step_230 HF checkpoint (exactly what dc3m70us trained from). +DEFAULT_MODEL_PATH="/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf" +MODEL_PATH="${1:-${MODEL_PATH:-${DEFAULT_MODEL_PATH}}}" + +# ================ Container and mount config ================ +# SWE training container (mcore + apptainer, working hermes tool parser). +export CONTAINER=${CONTAINER:-/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs} +GYM_CODE="${REPO_ROOT}/3rdparty/Gym-workspace/Gym" +export MOUNTS="/lustre:/lustre,$PWD:$PWD,${GYM_CODE}:/opt/nemo-rl/3rdparty/Gym-workspace/Gym,$PWD/nemo_rl:/opt/nemo-rl/nemo_rl" + +# ======================= Cluster / resources ======================= +NUM_GPU=8 +export GPUS_PER_NODE=${NUM_GPU} +export CPUS_PER_WORKER=114 + +# ============================ Parallelism ============================ +# SKIP_TRAINING=1 -> generation-only benchmark: training is a no-op on a SINGLE node +# (no optimizer, weights frozen, refit every step + keep-alive matmul). Training +# parallelism must fit 1 node, so model_parallel = TP*CP*PP must divide gpus_per_node(=8). +SKIP_TRAINING="${SKIP_TRAINING:-0}" +if [ "${SKIP_TRAINING}" = "1" ]; then + TP=8; EP=8; CP=1; PP=1; ETP=1 # model_parallel = 8 (fits 1 node), train_DP=1 +else + TP=4; EP=8; CP=4; PP=2; ETP=1 # linear-train default (model_parallel=32) +fi +VLLM_TP=2 +BACKEND="${BACKEND:-vllm}" # vllm | sglang +SGLANG_CHAT_TEMPLATE="${SGLANG_CHAT_TEMPLATE:-/lustre/fsw/portfolios/llmservice/users/spanev/swe2-repro/qwen3_swe_chat_template.jinja}" +MIN_PAD=1 +if [ ${CP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * CP * 2)); fi +if [ ${TP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * TP)); fi +MAKE_SEQ_DIVISIBLE_BY=${MIN_PAD} + +# ================= Generation-scaling: derive all sizes from R ================= +GPP=8 # generations per prompt (fixed) +SAMPLES_PER_REPLICA=2 # invariant: samples/replica/step +BASE_CONCURRENCY=768 # nemo-gym fan-out floor +REPLICAS_PER_NODE=$(( NUM_GPU / VLLM_TP )) # = 4 +MODEL_PARALLEL=$(( TP * CP * PP )) # = 32 +EXPERT_TMP=$(( ETP * EP * PP )) # = 16 + +NUM_VLLM_REPLICAS="${NUM_VLLM_REPLICAS:-}" +if [ -z "${NUM_VLLM_REPLICAS}" ]; then + echo "ERROR: NUM_VLLM_REPLICAS is required (number of vLLM replicas). e.g. NUM_VLLM_REPLICAS=64" >&2 + exit 1 +fi + +# Smallest valid step for R. +gcd() { local a=$1 b=$2 t; while [ ${b} -ne 0 ]; do t=${b}; b=$(( a % b )); a=${t}; done; echo ${a}; } +lcm() { echo $(( $1 / $(gcd $1 $2) * $2 )); } +if [ "${SKIP_TRAINING}" = "1" ]; then + # train fixed at 1 node (train_world=8, divisible by model_parallel=8); only gen + # must fill whole nodes -> R need only be a multiple of REPLICAS_PER_NODE (=4). + R_STEP=${REPLICAS_PER_NODE} +else + # linear train: train_world=2R must be divisible by model-parallel & expert sizes. + L=$(lcm ${MODEL_PARALLEL} ${EXPERT_TMP}) # train-world divisor + R_STEP_TRAIN=$(( L / $(gcd 2 ${L}) )) # since train_world = 2R + R_STEP=$(lcm ${R_STEP_TRAIN} ${REPLICAS_PER_NODE}) +fi +if [ $(( NUM_VLLM_REPLICAS % R_STEP )) -ne 0 ] || [ ${NUM_VLLM_REPLICAS} -lt ${R_STEP} ]; then + echo "ERROR: NUM_VLLM_REPLICAS must be a positive multiple of ${R_STEP} (got ${NUM_VLLM_REPLICAS})." >&2 + exit 1 +fi + +GEN_NODES=$(( NUM_VLLM_REPLICAS / REPLICAS_PER_NODE )) +if [ "${SKIP_TRAINING}" = "1" ]; then + TRAIN_NODES="${TRAIN_NODES:-1}" # no-op training: single node +else + TRAIN_NODES="${TRAIN_NODES:-${GEN_NODES}}" # linear 1:1 follow by default +fi +TOTAL_NODES=$(( TRAIN_NODES + GEN_NODES )) +PPS=$(( SAMPLES_PER_REPLICA * NUM_VLLM_REPLICAS / GPP )) +GBS=$(( PPS * GPP )) +CONCURRENCY=$(( GBS * 1 )) # GBS * max_trajectory_age_steps(=1) +if [ ${CONCURRENCY} -lt ${BASE_CONCURRENCY} ]; then CONCURRENCY=${BASE_CONCURRENCY}; fi + +# Sanity: training divisibility (also re-checks any TRAIN_NODES override). +TRAIN_WORLD=$(( TRAIN_NODES * NUM_GPU )) +if [ $(( TRAIN_WORLD % MODEL_PARALLEL )) -ne 0 ] || [ $(( TRAIN_WORLD % EXPERT_TMP )) -ne 0 ]; then + echo "ERROR: train world ${TRAIN_WORLD} (TRAIN_NODES=${TRAIN_NODES}) not divisible by model-parallel ${MODEL_PARALLEL} / expert ${EXPERT_TMP}." >&2 + exit 1 +fi +TRAIN_DP=$(( TRAIN_WORLD / MODEL_PARALLEL )) +if [ $(( GBS % TRAIN_DP )) -ne 0 ]; then + echo "ERROR: GBS ${GBS} not divisible by train DP ${TRAIN_DP}." >&2 + exit 1 +fi +PER_GPU_BATCH=$(( GBS / TRAIN_DP )) +PER_REPLICA_SAMPLES=$(( GBS / NUM_VLLM_REPLICAS )) + +# ===================== Sequence length & packing ===================== +SEQLEN=131072 +SEQUENCE_PACKING=True + +# ================= Sync/Async mode & async GRPO settings ================= +ASYNC_GRPO_ENABLED=True +MAX_TRAJECTORY_AGE_STEPS=1 +FORCE_ON_POLICY_RATIO=True +INFLIGHT_WEIGHT_UPDATE=True +RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES=False +SEQ_LOGPROB_ERROR_THRESHOLD=null +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + COLOCATED_ENABLED=False + VLLM_GPU_UTIL=0.8 + OVERLAP_GRAD_REDUCE=False + ADVANTAGE_CLIP_LOW=-100 + ADVANTAGE_CLIP_HIGH=100 + TIS_THRESHOLD=5 +else + COLOCATED_ENABLED=True + VLLM_GPU_UTIL=0.5 + OVERLAP_GRAD_REDUCE=True +fi + +# ========================= GRPO / sampling ========================= +NORMALIZE_REWARDS=True +OVERLONG_FILTERING=True + +# ========================== Loss function ========================== +KL=0 +CLIP_MIN=0.2 +CLIP_MAX=0.28 +USE_ON_POLICY_KL_APPROXIMATION=True +IMPORTANCE_SAMPLING_CORRECTION=True +SEQ_LEVEL_IS=False +TOKEN_LEVEL_LOSS=True + +# ============================ Optimizer ============================ +LR="1e-06" + +# =============================== MoE =============================== +MOE_FREEZE_ROUTER=True +MOE_PERMUTE_FUSION=True +MOE_ENABLE_DEEPEP=False +MOE_TOKEN_DISPATCHER_TYPE="alltoall" +MOE_AUX_LOSS_COEFF=0 +MOE_ROUTER_LOAD_BALANCING_TYPE="none" +MOE_ROUTER_BIAS_UPDATE_RATE="1e-3" + +# ======================= Generation / vLLM ======================= +TEMPERATURE=${TEMPERATURE:-1.0} + +# =================== Checkpointing & validation =================== +SAVE_PERIOD=5 +VAL_PERIOD=1000 +KEEP_TOP_K=2 + +# ============================ SWE agent ============================ +AGENT_MAX_TURNS=200 +AGENT_TIMEOUT=1800 + +# ============================== Logging ============================== +WANDB_PROJ="swe-benchmark" +# Shared group for the whole generation-scaling sweep (compare runs by R). +WANDB_GROUP="${WANDB_GROUP:-swe-gen-scale-linear}" +# Log full trajectories to wandb so we can verify function_call items appear. +LOG_GYM_RESPONSES=true + +# ========================= SLURM submission ========================= +SBATCH_ACCOUNT=${SBATCH_ACCOUNT:-nemotron_agents_dev} +SBATCH_PARTITION=${SBATCH_PARTITION:-backfill} +SBATCH_TIME="${SBATCH_TIME:-4:0:0}" +# Optional smoke-test knob: cap training steps (appended as ++grpo.max_num_steps). Empty = use YAML default. +MAX_NUM_STEPS="${MAX_NUM_STEPS:-}" + +# ========================= Experiment naming ========================= +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + SYNC_MODE="async-age${MAX_TRAJECTORY_AGE_STEPS}" +else + SYNC_MODE="sync" +fi +EXP_SUFFIX="${EXP_SUFFIX:-swe-genscale-${SYNC_MODE}-genrep${NUM_VLLM_REPLICAS}-nodes${TOTAL_NODES}-pps${PPS}-gpp${GPP}-gbs${GBS}-lr${LR}}" +WANDB_NAME="${EXP_SUFFIX}" +CHECKPOINT_DIR="${CHECKPOINT_ROOT}/${EXP_SUFFIX}" +SNAPSHOT_DIR="${REPO_ROOT}" + +mkdir -p "${CHECKPOINT_DIR}" + +# ============= Unified SLURM/Ray log location ============= +export BASE_LOG_DIR="${BASE_LOG_DIR:-${SNAPSHOT_DIR}/logs/swe_bench_scale}" +mkdir -p "${BASE_LOG_DIR}" + +# ========================= Environment variables ========================= +# Credentials are NOT sourced here. Export these yourself before submitting: +# HF_HOME, HF_TOKEN, WANDB_API_KEY (and GITHUB_TOKEN / GITLAB_TOKEN if needed) +export HUGGINGFACE_TOKEN="${HUGGINGFACE_TOKEN:-${HF_TOKEN}}" +export GITLAB_TOKEN="${GITLAB_TOKEN:-}" +export HF_DATASETS_CACHE="${HF_DATASETS_CACHE:-${HF_HOME}/datasets}" +export UV_CACHE_DIR="${UV_CACHE_DIR:-/tmp/uv_cache}" # sglang: set to /root/.cache/uv (baked prebuilt wheels) to skip ~40min compile +# Safe TE persistence (option B, seed-style — NO /root/.cache/uv override, so ray is untouched): +# the SETUP_COMMAND below rsyncs this Lustre seed (a harvested /tmp/uv_cache that already has the +# compiled transformer-engine wheel) into /tmp/uv_cache before the run, so the COMMAND's uv finds +# the prebuilt TE and skips the ~20-40min recompile. Empty seed => harmless (falls back to compile). +export LUSTRE_UV_CACHE_SEED="${LUSTRE_UV_CACHE_SEED:-}" +export UV_LOCK_TIMEOUT=3600 +export RAY_DEDUP_LOGS=1 +export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt +export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt +export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt +export OMP_NUM_THREADS=16 + +# ========================= Node-local cache config ========================= +PERSISTENT_CACHE="${PERSISTENT_CACHE:-${HOME}/.cache/qwen3_30b_thinking_swe_scale}" +export LUSTRE_VLLM_CACHE="${PERSISTENT_CACHE}/vllm_compile_cache" +export LUSTRE_INDUCTOR_CACHE="${PERSISTENT_CACHE}/inductor_cache" +export LUSTRE_TRITON_CACHE="${PERSISTENT_CACHE}/triton_cache" +export NRL_VLLM_LOCAL_CACHE_DIR="/tmp/nemo_rl_vllm_cache" +export NRL_VLLM_CACHE_SEED_DIR="/tmp/nemo_rl_vllm_cache_warm" +export INDUCTOR_CACHE_DIR="/tmp/nemo_rl_inductor_cache" +export TRITON_CACHE_DIR="/tmp/nemo_rl_triton_cache" +export CACHE_SYNC_FREQUENCY=120 +mkdir -p "${LUSTRE_VLLM_CACHE}" "${LUSTRE_INDUCTOR_CACHE}" "${LUSTRE_TRITON_CACHE}" + +# ============================== Summary ============================== +echo "==========================================" +echo "SWE generation-scaling | Experiment: ${EXP_SUFFIX}" +echo "Mode: ${SYNC_MODE}, Colocated: ${COLOCATED_ENABLED}" +echo "wandb: project=${WANDB_PROJ}, group=${WANDB_GROUP}, name=${WANDB_NAME}" +echo "------------------------------------------" +echo "Scaling input: NUM_VLLM_REPLICAS = ${NUM_VLLM_REPLICAS} (R-step=${R_STEP})" +echo " replicas/node = ${REPLICAS_PER_NODE} (vllm_tp=${VLLM_TP})" +echo " GEN_NODES = ${GEN_NODES}" +echo " TRAIN_NODES = ${TRAIN_NODES} (train_DP=${TRAIN_DP})" +echo " TOTAL_NODES = ${TOTAL_NODES}" +echo " PPS = ${PPS}" +echo " GPP = ${GPP}" +echo " GBS = ${GBS}" +echo " CONCURRENCY = ${CONCURRENCY}" +echo " invariants : samples/replica=${PER_REPLICA_SAMPLES}, batch/train-GPU=${PER_GPU_BATCH}" +echo "Parallelism: TP=${TP}, EP=${EP}, CP=${CP}, PP=${PP}, vLLM_TP=${VLLM_TP}, pad=${MAKE_SEQ_DIVISIBLE_BY}" +echo "Model: ${MODEL_PATH}" +echo "Checkpoint: ${CHECKPOINT_DIR}" +echo "==========================================" + +cd "${SNAPSHOT_DIR}" + +# ================ SETUP_COMMAND (bihu's: install apptainer + seed caches + uv sync) ================ +read -r -d '' SETUP_COMMAND </dev/null || true +RET=1 +RETRIES=3 +for attempt in \$(seq 1 \$RETRIES); do + if command -v apptainer >/dev/null 2>&1 || command -v singularity >/dev/null 2>&1; then + echo "[SETUP] singularity/apptainer already available" + RET=0 + break + fi + cd /tmp && \ + wget --no-check-certificate -q https://github.com/apptainer/apptainer/releases/download/v1.3.1/apptainer_1.3.1_amd64.deb && \ + apt install -y ./apptainer_1.3.1_amd64.deb && \ + ln -sf /usr/bin/apptainer /usr/bin/singularity + if command -v apptainer >/dev/null 2>&1; then + echo "[SETUP] apptainer installed successfully" + RET=0 + break + fi + echo "[SETUP] apptainer install attempt \$attempt failed, retrying..." + sleep 10 +done +if [ \$RET -ne 0 ]; then + echo "[SETUP] WARNING: apptainer installation failed after \$RETRIES attempts" +fi + +echo "[CACHE SEED] Clearing stale /tmp caches and seeding from Lustre..." +rm -rf /tmp/nemo_rl_vllm_cache /tmp/nemo_rl_vllm_cache_* +rm -rf "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}" +mkdir -p "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}" + +find "${LUSTRE_INDUCTOR_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true +find "${LUSTRE_TRITON_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true + +_seed_cache() { + local lustre="\$1" local_dir="\$2" name="\$3" + if [ -d "\$lustre" ] && [ "\$(ls -A "\$lustre" 2>/dev/null)" ]; then + rsync -a --exclude '.tmp_*' "\$lustre/" "\$local_dir/" 2>/dev/null \ + && echo "[CACHE SEED] \$name: seeded from Lustre" \ + || echo "[CACHE SEED] \$name: seed failed (non-fatal)" + else + echo "[CACHE SEED] \$name: no warm cache on Lustre yet" + fi +} + +_seed_cache "${LUSTRE_INDUCTOR_CACHE}" "${INDUCTOR_CACHE_DIR}" "Inductor" +_seed_cache "${LUSTRE_TRITON_CACHE}" "${TRITON_CACHE_DIR}" "Triton" +mkdir -p /tmp/uv_cache +_seed_cache "${LUSTRE_UV_CACHE_SEED}" "/tmp/uv_cache" "uv (prebuilt transformer-engine)" +echo "[CACHE SEED] Done." + +UV_HTTP_TIMEOUT=3600 \ + uv sync --frozen --extra mcore +SETUPEOF +export SETUP_COMMAND + +# ================ Training command (bihu-style: uv run --frozen, no --extra mcore) ================ +# ===== backend-specific generation overrides (single-line; expanded into COMMAND) ===== +if [ "${BACKEND}" = "sglang" ]; then + GEN_BACKEND_OVERRIDES="++policy.generation.backend=sglang ++policy.generation.sglang_cfg.model_path=${MODEL_PATH} ++policy.generation.sglang_cfg.random_seed=42 ++policy.generation.sglang_cfg.dp_size=1 ++policy.generation.sglang_cfg.ep_size=1 ++policy.generation.sglang_cfg.pp_size=1 ++policy.generation.sglang_cfg.skip_server_warmup=true ++policy.generation.sglang_cfg.context_length=${SEQLEN} ++policy.generation.sglang_cfg.dtype=bfloat16 ++policy.generation.sglang_cfg.mem_fraction_static=0.55 ++policy.generation.sglang_cfg.disable_piecewise_cuda_graph=true ++policy.generation.sglang_cfg.disable_cuda_graph=${SGLANG_DISABLE_CUDA_GRAPH:-false} ++policy.generation.sglang_cfg.tool_call_parser=hermes ++policy.generation.sglang_cfg.reasoning_parser=qwen3-thinking ++policy.generation.sglang_cfg.chat_template=${SGLANG_CHAT_TEMPLATE} ++policy.generation.sglang_server.needs_offload=false ++policy.generation.sglang_server.cpu_weight_backup=false ++policy.generation.sglang_server.sglang_server_concurrency=${CONCURRENCY} ++policy.generation.sglang_server.pause_generation_mode=retract ++policy.generation.sglang_server.num_gpus=$(( GEN_NODES * NUM_GPU )) ++policy.generation.sglang_server.num_gpus_per_engine=${NUM_GPU} ++policy.generation.sglang_router.enabled=false ++env.nemo_gym.policy_model.responses_api_models.vllm_model.engine=sglang ++env.nemo_gym.policy_model.responses_api_models.vllm_model.sglang_chat_template_path=${SGLANG_CHAT_TEMPLATE} ++env.nemo_gym.policy_model.responses_api_models.vllm_model.sglang_max_total_sequence_length=${SEQLEN}" +else + GEN_BACKEND_OVERRIDES="++policy.generation.backend=vllm policy.generation.vllm_cfg.tensor_parallel_size=${VLLM_TP} policy.generation.vllm_cfg.gpu_memory_utilization=${VLLM_GPU_UTIL} policy.generation.vllm_cfg.skip_tokenizer_init=False" +fi + +export COMMAND="NRL_VLLM_USE_V1=1 \ + NRL_REFIT_BUFFER_MEMORY_RATIO=0.018 \ + NRL_WG_USE_RAY_REF=1 \ + WANDB_API_KEY=${WANDB_API_KEY} \ + WANDB_MODE=${WANDB_MODE} \ + HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} \ + GITHUB_TOKEN=${GITHUB_TOKEN} \ + GITLAB_TOKEN=${GITLAB_TOKEN} \ + HF_HOME=${HF_HOME} \ + HF_DATASETS_CACHE=${HF_DATASETS_CACHE} \ + UV_CACHE_DIR=${UV_CACHE_DIR} \ + VLLM_ATTENTION_BACKEND=FLASH_ATTN \ + VLLM_CACHE_ROOT=${LUSTRE_VLLM_CACHE} \ + DG_JIT_CACHE_DIR=${LUSTRE_VLLM_CACHE}/deep_gemm \ + VLLM_DEEP_GEMM_WARMUP=skip \ + NRL_FORCE_REBUILD_VENVS=false \ + NRL_IGNORE_VERSION_MISMATCH=1 \ + RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 \ + UV_HTTP_TIMEOUT=3600 \ + UV_LOCK_TIMEOUT=900 \ + TORCH_CUDA_ARCH_LIST='9.0 10.0' \ + NEMO_GYM_SKIP_VENV_IF_PRESENT=1 \ + uv run --frozen --extra mcore ./examples/nemo_gym/run_grpo_nemo_gym.py \ + --config=${CONFIG_FILE} \ + cluster.num_nodes=${TOTAL_NODES} \ + cluster.gpus_per_node=${NUM_GPU} \ + ++data.train.data_path=${TRAIN_DATA_PATH} \ + ++data.validation.data_path=${VAL_DATA_PATH} \ + grpo.num_prompts_per_step=${PPS} \ + grpo.num_generations_per_prompt=${GPP} \ + grpo.val_at_start=False \ + grpo.normalize_rewards=${NORMALIZE_REWARDS} \ + grpo.overlong_filtering=${OVERLONG_FILTERING} \ + grpo.val_period=${VAL_PERIOD} \ + grpo.seq_logprob_error_threshold=${SEQ_LOGPROB_ERROR_THRESHOLD} \ + grpo.async_grpo.enabled=${ASYNC_GRPO_ENABLED} \ + grpo.async_grpo.in_flight_weight_updates=${INFLIGHT_WEIGHT_UPDATE} \ + grpo.async_grpo.recompute_kv_cache_after_weight_updates=${RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES} \ + grpo.async_grpo.max_trajectory_age_steps=${MAX_TRAJECTORY_AGE_STEPS} \ + env.should_log_nemo_gym_responses=${LOG_GYM_RESPONSES} \ + policy.generation.colocated.enabled=${COLOCATED_ENABLED} \ + policy.model_name=${MODEL_PATH} \ + policy.max_total_sequence_length=${SEQLEN} \ + policy.dynamic_batching.enabled=False \ + policy.train_global_batch_size=${GBS} \ + policy.make_sequence_length_divisible_by=${MAKE_SEQ_DIVISIBLE_BY} \ + policy.offload_optimizer_for_logprob=true \ + policy.sequence_packing.enabled=${SEQUENCE_PACKING} \ + policy.megatron_cfg.tensor_model_parallel_size=${TP} \ + policy.megatron_cfg.expert_model_parallel_size=${EP} \ + policy.megatron_cfg.context_parallel_size=${CP} \ + policy.megatron_cfg.pipeline_model_parallel_size=${PP} \ + policy.megatron_cfg.sequence_parallel=True \ + policy.megatron_cfg.bias_activation_fusion=False \ + policy.megatron_cfg.distributed_data_parallel_config.overlap_grad_reduce=${OVERLAP_GRAD_REDUCE} \ + policy.megatron_cfg.moe_permute_fusion=${MOE_PERMUTE_FUSION} \ + policy.megatron_cfg.moe_enable_deepep=${MOE_ENABLE_DEEPEP} \ + policy.megatron_cfg.moe_token_dispatcher_type=${MOE_TOKEN_DISPATCHER_TYPE} \ + policy.megatron_cfg.moe_aux_loss_coeff=${MOE_AUX_LOSS_COEFF} \ + policy.megatron_cfg.moe_router_load_balancing_type=${MOE_ROUTER_LOAD_BALANCING_TYPE} \ + policy.megatron_cfg.moe_router_bias_update_rate=${MOE_ROUTER_BIAS_UPDATE_RATE} \ + policy.megatron_cfg.freeze_moe_router=${MOE_FREEZE_ROUTER} \ + policy.megatron_cfg.optimizer.lr=${LR} \ + policy.megatron_cfg.optimizer.min_lr=${LR} \ + policy.megatron_cfg.optimizer.weight_decay=0 \ + policy.megatron_cfg.empty_unused_memory_level=2 \ + policy.megatron_cfg.activation_checkpointing=True \ + policy.generation.temperature=${TEMPERATURE} \ + ${GEN_BACKEND_OVERRIDES} \ + loss_fn.reference_policy_kl_penalty=${KL} \ + loss_fn.ratio_clip_min=${CLIP_MIN} \ + loss_fn.ratio_clip_max=${CLIP_MAX} \ + loss_fn.use_on_policy_kl_approximation=${USE_ON_POLICY_KL_APPROXIMATION} \ + loss_fn.use_importance_sampling_correction=${IMPORTANCE_SAMPLING_CORRECTION} \ + loss_fn.sequence_level_importance_ratios=${SEQ_LEVEL_IS} \ + loss_fn.token_level_loss=${TOKEN_LEVEL_LOSS} \ + loss_fn.force_on_policy_ratio=${FORCE_ON_POLICY_RATIO} \ + checkpointing.checkpoint_dir=${CHECKPOINT_DIR} \ + checkpointing.save_period=${SAVE_PERIOD} \ + checkpointing.keep_top_k=${KEEP_TOP_K} \ + ++checkpointing.metric_name=train:total_reward/mean \ + ++checkpointing.checkpoint_must_save_by=00:03:35:00 \ + logger.wandb_enabled=True \ + logger.wandb.name=${WANDB_NAME} \ + logger.wandb.project=${WANDB_PROJ} \ + ++logger.wandb.group=${WANDB_GROUP}" + +if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then + export COMMAND="${COMMAND} \ + policy.generation.colocated.resources.num_nodes=${GEN_NODES} \ + policy.generation.colocated.resources.gpus_per_node=${NUM_GPU} \ + grpo.advantage_clip_low=${ADVANTAGE_CLIP_LOW} \ + grpo.advantage_clip_high=${ADVANTAGE_CLIP_HIGH} \ + loss_fn.truncated_importance_sampling_ratio=${TIS_THRESHOLD} \ + env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \ + env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \ + env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.concurrency=${CONCURRENCY} \ + env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \ + env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \ + env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.concurrency=${CONCURRENCY}" +fi + +# Optional: cap training steps (smoke test). +if [ -n "${MAX_NUM_STEPS}" ]; then + export COMMAND="${COMMAND} grpo.max_num_steps=${MAX_NUM_STEPS}" +fi + +# Generation-only benchmark: no-op training (no optimizer) + disable checkpoint saving. +if [ "${SKIP_TRAINING}" = "1" ]; then + export COMMAND="${COMMAND} ++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false" +fi + +# ================ Submit job (skipped under DRY_RUN=1) ================ +if [ "${DRY_RUN:-0}" = "1" ]; then + echo "" + echo "[DRY_RUN] Not submitting. Would run:" + echo "[DRY_RUN] COMMAND:"; echo "$COMMAND" | tr ' ' '\n' | grep -E "backend=|sglang_|vllm_cfg" || true + echo "[DRY_RUN] sbatch --nodes=${TOTAL_NODES} --account=${SBATCH_ACCOUNT} --partition=${SBATCH_PARTITION} --time=${SBATCH_TIME} --gres=gpu:${NUM_GPU} ... ray.sub" + cd - > /dev/null + exit 0 +fi + +# ===== PERSISTENT (idle Ray cluster) mode ===== +if [ "${PERSISTENT:-0}" = "1" ]; then + DRIVER_FILE="${REPO_ROOT}/swe2_driver_${EXP_SUFFIX}.cmd" + printf '%s\n' "${COMMAND}" > "${DRIVER_FILE}" + echo "[PERSISTENT] driver command saved -> ${DRIVER_FILE}" + export COMMAND="" +fi + +sbatch \ + --nodes="${TOTAL_NODES}" \ + --account="${SBATCH_ACCOUNT}" \ + --job-name="${WANDB_NAME}" \ + --partition="${SBATCH_PARTITION}" \ + --time="${SBATCH_TIME}" \ + --gres=gpu:${NUM_GPU} \ + --output="${BASE_LOG_DIR}/slurm-%j.out" \ + --exclusive \ + --dependency=singleton \ + --comment='{"OccupiedIdleGPUsJobReaper":{"exemptIdleTimeMins":"180","reason":"data_loading","description":"Async GRPO SWE generation-scaling benchmark"}}' \ + ray.sub | tee /dev/stderr | grep -o '[0-9]\+' > latest_scale_gen_job_id.txt + +JOB_ID="$(cat latest_scale_gen_job_id.txt)" +echo "==========================================" +echo "Job submitted: ${EXP_SUFFIX}" +echo "Job ID: ${JOB_ID}" +echo "wandb group: ${WANDB_GROUP}" +echo "Monitor with: squeue -j ${JOB_ID}" +echo "Ray/SLURM logs: ${BASE_LOG_DIR}/${JOB_ID}-logs/" +echo "Checkpoints: ${CHECKPOINT_DIR}/" +echo "==========================================" + +cd - > /dev/null