diff --git a/examples/swe_bench/REPRO_swe2.md b/examples/swe_bench/REPRO_swe2.md
new file mode 100644
index 0000000000..8914567a5c
--- /dev/null
+++ b/examples/swe_bench/REPRO_swe2.md
@@ -0,0 +1,398 @@
+# Reproducing the baseline SWE2 Async-GRPO run
+
+Step-by-step guide to reproduce baseline's successful SWE2 GRPO run
+(wandb `nvidia/binhu-nemo-rl/dc3m70us`) using:
+
+- **Cluster:** `cw-dfw-cs`
+- **Branch:** `ruit/SWE_bench` (repo `github.com/NVIDIA-NeMo/RL`)
+- **Launcher:** `${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh`
+- **Config:** `${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml` (passed to the launcher via `--config`)
+
+The goal of this run is to confirm that the earlier *zero-reward* failure was
+caused by the **container / vLLM** (a broken hermes tool parser that prevented
+the agent from emitting real tool calls), **not** by the model or the config.
+A correct repro resolves ~8% of SWE-bench instances starting from step 1.
+
+> Run this on **`cw-dfw-cs`**. **Do not run from anyone else's checkout** — clone
+> the repo into your own workspace (§2.1). `REPO_ROOT` below means *your* clone;
+> the launcher auto-detects it from its own location. The model / data / container
+> paths are absolute and world-readable on the `cw-dfw-cs` Lustre.
+
+---
+
+## 1. What this run is
+
+| Item | Value |
+|------|-------|
+| Algorithm | Async GRPO (non-colocated generation) |
+| Model | Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / 3B active) |
+| Init checkpoint | SWE1 `step_230_hf` (the exact checkpoint dc3m70us trained from) |
+| Train data | R2E-Gym subset (`swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl`) |
+| Eval data | same JSONL (val == train path here) |
+| Env | `swe_agents` (OpenHands agent inside an apptainer/singularity sandbox) |
+| Entry point | `${REPO_ROOT}/examples/nemo_gym/run_grpo_nemo_gym.py` |
+| Scheduler | SLURM (`sbatch` + `ray.sub`) |
+
+---
+
+## 2. Prerequisites
+
+### 2.1 Get the code (clone into your own workspace)
+
+On `cw-dfw-cs`, clone the repo into a directory you own and check out the
+`ruit/SWE_bench` branch. Do **not** run from someone else's checkout.
+
+```bash
+cd /lustre/<your-own-workspace-on-cw-dfw-cs>
+git clone https://github.com/NVIDIA-NeMo/RL.git
+cd RL
+git checkout ruit/SWE_bench
+git submodule update --init --recursive   # needed for the Gym mount (3rdparty/Gym-workspace/Gym)
+
+export REPO_ROOT="$PWD"                     # = your clone; the launcher also auto-detects this
+```
+
+> The launcher runs the code **in place** from your clone (`SNAPSHOT_DIR ==
+> REPO_ROOT`). Whatever is in the working tree at submit time is what runs, so
+> avoid stray local edits.
+
+### 2.2 Container
+
+Uses the **SWE training container** (`ruit-swe_bench`, with mcore + apptainer baked
+in), NOT the default NeMo-RL image. Its vLLM has the working hermes tool parser, so
+the agent emits real `function_call` items (a broken parser was the original
+zero-reward failure mode):
+
+```
+/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs
+```
+
+It is wired in via the `CONTAINER` env var (overridable). The job mounts:
+
+```
+/lustre  ->  /lustre
+${REPO_ROOT}  ->  (same path; $PWD)
+${REPO_ROOT}/3rdparty/Gym-workspace/Gym  ->  /opt/nemo-rl/3rdparty/Gym-workspace/Gym
+```
+
+The last mount overlays the in-repo Gym source over the container's Gym so
+your Gym checkout is what runs.
+
+### 2.3 Required files on Lustre
+
+Confirm these absolute paths exist before submitting:
+
+| Path | Purpose |
+|------|---------|
+| `/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf` | init checkpoint |
+| `/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl` | train + val data |
+| `${REPO_ROOT}/ray.sub` | SLURM launcher consumed by `sbatch` |
+| `/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs` | training container |
+
+Per-instance SWE-bench `.sif` sandbox images (resolved by `container_formatter`
+in the YAML, first match wins):
+
+```
+/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif
+/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif
+/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif
+/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif
+```
+
+### 2.4 Tokens / credentials
+
+Credentials were **stripped from this shared copy** of the launcher — it no
+longer sources any env script. Before submitting, export these yourself:
+
+- `HF_HOME` — HuggingFace cache root (passed through to the job; also used to
+  derive `HF_DATASETS_CACHE`)
+- `HF_TOKEN` — required if the model/tokenizer is gated
+- `WANDB_API_KEY` — required for wandb logging (`logger.wandb_enabled=True`)
+- `GITHUB_TOKEN` — only if your data/repo access needs it
+
+### 2.5 Caches (created automatically, listed for reference)
+
+The launcher seeds vLLM/inductor/triton caches from a persistent dir under your
+own `$HOME` (override with the `PERSISTENT_CACHE` env var):
+
+```
+Persistent (default ${HOME}/.cache/qwen3_30b_thinking_swe_repro_baseline):
+  .../vllm_compile_cache
+  .../inductor_cache
+  .../triton_cache
+
+Node-local (/tmp, recreated each run):
+  /tmp/nemo_rl_vllm_cache
+  /tmp/nemo_rl_inductor_cache
+  /tmp/nemo_rl_triton_cache
+  /tmp/uv_cache
+```
+
+---
+
+## 3. Key configuration (what gets reproduced)
+
+The launcher overrides the YAML on the command line. The values that define
+the run:
+
+**Cluster / parallelism**
+- `NUM_NODES=16` actor nodes, `8` generation nodes (async, non-colocated), 8 GPUs/node
+- `TP=4`, `EP=8`, `CP=4`, `PP=2`, `vLLM_TP=2`
+- `make_sequence_length_divisible_by = 32` (auto: `CP*2*TP = 4*2*4`)
+
+**GRPO / sampling**
+- `num_prompts_per_step=8`, `num_generations_per_prompt=8`, `train_global_batch_size=64`
+- `normalize_rewards=True`, `overlong_filtering=True`
+- Async: `max_trajectory_age_steps=1`, `in_flight_weight_updates=True`,
+  `recompute_kv_cache_after_weight_updates=False`, `force_on_policy_ratio=True`
+- `advantage_clip=[-100, 100]`, `truncated_importance_sampling_ratio=5`
+
+**Loss**
+- `reference_policy_kl_penalty=0` (no KL), `ratio_clip=[0.2, 0.28]`
+- `token_level_loss=True`, `use_importance_sampling_correction=True`,
+  `sequence_level_importance_ratios=False`
+
+**Optimizer / model**
+- `lr=1e-06` (constant), `weight_decay=0`
+- `max_total_sequence_length=131072`, sequence packing on
+- MoE: router frozen, `moe_aux_loss_coeff=0`, `alltoall` dispatcher, deepep off
+
+**SWE agent**
+- `agent_max_turns=200`, `swebench_agent_timeout=1800`
+
+**Logging**
+- wandb project `swe-benchmark`, full Gym responses logged
+  (`should_log_nemo_gym_responses=true`) so you can verify `function_call`
+  items actually appear.
+
+---
+
+## 4. Command flavor (why it differs from the default)
+
+The training command is **baseline-style**, which is what makes the container work:
+
+- `uv run --frozen --extra mcore` (frozen lockfile)
+- `NRL_IGNORE_VERSION_MISMATCH=1` — tolerate the container's vLLM version
+- `NEMO_GYM_SKIP_VENV_IF_PRESENT=1` — reuse the container's Gym venv, don't rebuild
+- `NRL_FORCE_REBUILD_VENVS=false`, `RAY_ENABLE_UV_RUN_RUNTIME_ENV=0`
+- vLLM caches seeded from a persistent Lustre cache, then synced back
+
+The `SETUP_COMMAND` (run once per node before training) installs
+apptainer/singularity (for the SWE sandbox), clears + seeds the inductor/triton
+caches from Lustre, and runs `uv sync --frozen --extra mcore`.
+
+---
+
+## 5. Step-by-step
+
+```bash
+# 1. Clone into your own workspace on cw-dfw-cs and check out the branch
+cd /lustre/<your-own-workspace-on-cw-dfw-cs>
+git clone https://github.com/NVIDIA-NeMo/RL.git
+cd RL
+git checkout ruit/SWE_bench
+git submodule update --init --recursive   # Gym mount (3rdparty/Gym-workspace/Gym)
+export REPO_ROOT="$PWD"
+
+# 2. Export your credentials (stripped from this copy — see §2.4)
+export HF_HOME=/your/hf/home
+export HF_TOKEN=...          # if the model/tokenizer is gated
+export WANDB_API_KEY=...     # for wandb logging
+export GITHUB_TOKEN=...      # only if your data/repo access needs it
+
+# 3. (Optional) sanity-check the shared assets exist (readable on cw-dfw-cs)
+ls "/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs"
+ls -d "/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf"
+ls "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl"
+
+# 4. Submit from your clone. Defaults reproduce dc3m70us; no other args needed.
+bash "${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh"
+```
+
+The script prints a summary, submits via `sbatch`, and writes the job id to
+`${REPO_ROOT}/latest_repro_baseline_job_id.txt`.
+
+### Overridable knobs (env vars)
+
+| Var | Default | Effect |
+|-----|---------|--------|
+| `MODEL_PATH` (also `$1`) | `/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf` | init checkpoint |
+| `CONTAINER` | `/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs` | training image |
+| `NUM_NODES` | 16 | actor nodes |
+| `NUM_GEN_NODES` | 8 | generation nodes (async only) |
+| `SKIP_TRAINING` | `0` | `1` = generation-only benchmark: no-op training pinned to 1 node (see §9) |
+| `EXP_SUFFIX` | `repro-baseline-swe2-async-age1-pps8-gpp8-gbs64-lr1e-06-tp4` | run + checkpoint dir name (`notrain-` is inserted when `SKIP_TRAINING=1`) |
+| `BASE_LOG_DIR` | `${REPO_ROOT}/logs/slurm` | SLURM/Ray logs |
+
+Example — different init checkpoint, smaller cluster:
+
+```bash
+NUM_NODES=8 NUM_GEN_NODES=4 \
+  bash ${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh \
+  /path/to/other/step_X_hf
+```
+
+---
+
+## 6. Monitoring
+
+```bash
+JOB_ID=$(cat ${REPO_ROOT}/latest_repro_baseline_job_id.txt)
+
+squeue -j "$JOB_ID"                          # queue state
+ls ${REPO_ROOT}/logs/slurm/${JOB_ID}-logs/           # Ray + SLURM logs
+tail -f ${REPO_ROOT}/logs/slurm/slurm-${JOB_ID}.out  # driver output
+```
+
+- **wandb:** project `swe-benchmark`, run name = `EXP_SUFFIX`.
+- **Checkpoints:**
+  `${REPO_ROOT}/results/${EXP_SUFFIX}/`
+  (save every 5 steps, keep top 2 by `train:total_reward/mean`).
+
+### What "success" looks like
+- `train:total_reward/mean` is **non-zero from step ~1** (the failure mode was
+  identically zero reward).
+- Logged Gym responses contain real `function_call` items (proves the hermes
+  tool parser is working in this container).
+- Resolved rate climbs toward ~8%.
+
+---
+
+## 7. Troubleshooting
+
+| Symptom | Likely cause | Fix |
+|---------|-------------|-----|
+| Reward is identically 0 | wrong container — hermes tool parser broken, no tool calls | confirm `CONTAINER` is the `ruit-swe_bench` squashfs, not the default image |
+| `version mismatch` abort | strict version check | ensure `NRL_IGNORE_VERSION_MISMATCH=1` is in the command (it is, by default) |
+| Gym venv rebuild / slowness | venv rebuilt instead of reused | confirm `NEMO_GYM_SKIP_VENV_IF_PRESENT=1` and the Gym mount are present |
+| Agent can't start sandbox | apptainer/singularity missing or `.sif` images missing | check `SETUP_COMMAND` apptainer install succeeded; verify `container_formatter` paths in the YAML |
+| Token / auth errors | credentials not exported (this copy ships none) | `export HF_HOME`/`HF_TOKEN`/`WANDB_API_KEY`/`GITHUB_TOKEN` before submitting (see §2.4) |
+| OOM / parallelism mismatch | changed `TP`/`EP`/`CP`/`PP` without re-deriving `make_sequence_length_divisible_by` | keep the default parallelism, or recompute `MIN_PAD = CP*2 * TP` |
+
+---
+
+## 8. Reference — exact pinned values
+
+```
+Code:        NeMo-RL @ branch ruit/SWE_bench (run in place from your clone)
+Compute:     cw-dfw-cs (SLURM)
+Repo:        github.com/NVIDIA-NeMo/RL  @  branch ruit/SWE_bench
+REPO_ROOT:   your clone (export REPO_ROOT=<clone>; launcher also auto-detects it)
+Container:   /lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs
+Init model:  /lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf
+Train data:  /lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl
+Config:      ${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml
+Launcher:    ${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh
+Mode:        async-age1, colocated=False
+Resources:   16 actor nodes + 8 gen nodes, 8 GPU/node
+Parallelism: TP=4, EP=8, CP=4, PP=2, vLLM_TP=2, pad=32
+Training:    PPS=8, GPP=8, GBS=64, LR=1e-06
+Loss:        KL=0, clip=[0.2,0.28], token-level, IS correction on, TIS=5
+Agent:       max_turns=200, timeout=1800s
+wandb:       project=swe-benchmark
+Baseline:    nvidia/binhu-nemo-rl/dc3m70us (~8% resolved from step 1)
+```
+
+---
+
+## 9. Generation-only benchmark (skip training)
+
+For **benchmarking generation throughput / scaling** without paying for real
+training, the launcher has a no-op-training mode, gated by the
+`grpo.gen_benchmark_skip_training` flag (added on `ruit/SWE_bench`). Set
+`SKIP_TRAINING=1`:
+
+```bash
+SKIP_TRAINING=1 bash "${REPO_ROOT}/examples/swe_bench/run_grpo_repro_baseline_swe2.sh"
+```
+
+### What it does
+- **`policy.train()` becomes a no-op** — no forward/backward, no optimizer step. The
+  weights stay frozen at the init checkpoint and are **still refit to vLLM every
+  step**, so the async generation / weight-sync cadence stays realistic.
+- **No optimizer is built** (`init_optimizer=False`) — saves memory and startup time.
+- A tiny **keep-alive matmul daemon** runs on each training worker so the cluster's
+  idle-GPU reaper doesn't kill the (otherwise idle) training node.
+- **Checkpoint saving is disabled** (`checkpointing.enabled=false`) — there is no
+  optimizer/training state to save.
+
+### What the launcher changes automatically when `SKIP_TRAINING=1`
+- Training parallelism → **`TP=8, EP=8, CP=1, PP=1`** (model-parallel = 8, fits one
+  node; `train_DP=1`), so training is pinned to a **single node**.
+- `NUM_ACTOR_NODES = NUM_GEN_NODES + 1` → total nodes = `gen + 1` (default `8 + 1 = 9`;
+  8 generation nodes = 32 vLLM replicas at `vLLM_TP=2`).
+- Appends `++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false`.
+- `EXP_SUFFIX` gets a `notrain-` tag.
+
+Everything else (model, data, `PPS=8/GPP=8/GBS=64`, agent settings, container) is
+unchanged, so the per-replica generation workload (`samples/replica = GBS / replicas
+= 64 / 32 = 2`) matches the full run.
+
+### How to verify the scaling is sound (wandb)
+Compare runs at different generation sizes (vary `NUM_GEN_NODES`) within one wandb
+group. The **per-replica** `generation_metrics/*` timelines should stay **flat**
+(invariant) as you add replicas — not grow with scale:
+
+| metric | expectation across scale |
+|--------|--------------------------|
+| `generation_metrics/*inflight_batch_sizes` | flat, low (≈1–3 per replica) |
+| `generation_metrics/*num_pending_samples` | ≈ 0 (no queue backlog) |
+| `generation_metrics/*kv_cache_usage_perc` | flat (≈8–10%) |
+| `generation_metrics/*generation_tokens` | flat per replica per window |
+| worker-trace count | equals the replica count (`gen_gpus / vLLM_TP`) |
+
+> Note: SWE rollouts are **agent / tool-execution-bound** (each sample is a multi-turn
+> OpenHands rollout in an apptainer sandbox), so per-replica inflight/KV stay low and
+> total throughput scales sub-linearly with GPUs — that is expected, not a regression.
+> Weights are frozen, so reward hovers around the init checkpoint's baseline (noisy on
+> small per-step sample counts); this mode is for **throughput/scaling**, not learning.
+
+---
+
+## 10. Generation-scaling sweep launcher (`run_grpo_swe2_scale_gen.sh`)
+
+For sweeping the number of vLLM generation replicas, use the second launcher:
+`${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh`. It takes a **single
+knob — `NUM_VLLM_REPLICAS` (R)** — and auto-derives nodes / `num_prompts_per_step` /
+`train_global_batch_size` so the **per-replica generation workload stays constant**
+(`samples/replica/step = 2`) across scales. Same model / data / config / container
+as the baseline run.
+
+```bash
+# preview the derived config without submitting
+NUM_VLLM_REPLICAS=32 DRY_RUN=1 bash "${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh"
+
+# a sweep, all in one wandb group for comparison
+for R in 16 32 64; do
+  NUM_VLLM_REPLICAS=$R WANDB_GROUP=swe-gen-scale-sweep \
+    bash "${REPO_ROOT}/examples/swe_bench/run_grpo_swe2_scale_gen.sh"
+done
+```
+
+Derivation (with `GPP=8`, `vLLM_TP=2` → 4 replicas/node):
+
+| mode | `R` constraint | GEN nodes | TRAIN nodes | total | PPS | GBS | train parallelism |
+|------|----------------|-----------|-------------|-------|-----|-----|-------------------|
+| **linear** (default) | multiple of **16** | `R/4` | `R/4` (1:1) | `R/2` | `R/4` | `2R` | TP=4,EP=8,CP=4,PP=2 |
+| **skip-train** (`SKIP_TRAINING=1`) | multiple of **4** | `R/4` | **1** | `R/4 + 1` | `R/4` | `2R` | TP=8,EP=8,CP=1,PP=1 |
+
+`R=32` (linear) reproduces the baseline shape exactly (16 nodes = 8 train + 8 gen,
+PPS=8, GBS=64). The `R%16` requirement in linear mode comes from training scaling
+linearly at TP×CP×PP=32 (train world `2R` must be divisible by 32); `SKIP_TRAINING=1`
+pins training to one node (model-parallel 8) so `R` need only be a multiple of 4 —
+enabling small scales like R=4 (2 nodes) / R=8 (3 nodes). See §9 for the no-op-train
+semantics, and §9's wandb table for what to verify across the sweep.
+
+### Knobs (env vars)
+
+| Var | Default | Effect |
+|-----|---------|--------|
+| `NUM_VLLM_REPLICAS` | *(required)* | number of vLLM replicas (R) |
+| `SKIP_TRAINING` | `0` | `1` = no-op training on 1 node (R%4); else linear-train (R%16) |
+| `TRAIN_NODES` | derived | override training node count |
+| `WANDB_GROUP` | `swe-gen-scale-linear` | wandb group (use one per sweep) |
+| `MAX_NUM_STEPS` | *(unset)* | cap training steps (handy for a quick smoke) |
+| `SBATCH_TIME` | `4:0:0` | SLURM walltime |
+| `DRY_RUN` | `0` | `1` = print the derived config and exit (no `sbatch`) |
+
+Job id is written to `${REPO_ROOT}/latest_scale_gen_job_id.txt`.
diff --git a/examples/swe_bench/REPRO_swe2_sglang.md b/examples/swe_bench/REPRO_swe2_sglang.md
new file mode 100644
index 0000000000..4e922d9797
--- /dev/null
+++ b/examples/swe_bench/REPRO_swe2_sglang.md
@@ -0,0 +1,232 @@
+# Reproducing SWE2 Async-GRPO on **SGLang** (Qwen3-30B-A3B-Thinking)
+
+Self-contained guide to run the multi-turn SWE-bench agentic GRPO recipe with the
+**SGLang** generation backend, at parity with the vLLM baseline (rollout completeness +
+throughput) and at **training-grade per-token logprob parity** with vLLM.
+
+This is the SGLang counterpart of [`REPRO_swe2.md`](./REPRO_swe2.md) (the vLLM baseline).
+Everything needed lives in **one clone** — you do **not** need to fetch any other PR.
+
+---
+
+## 0. TL;DR
+
+```bash
+# One clone has everything (RL + splice-fixed Gym + grafted SGLang backend).
+git clone --recurse-submodules -b swe2-qwen-sglang git@github.com:Kh4L/NemoRL.git
+cd NemoRL
+export REPO_ROOT="$PWD"
+
+# Credentials (not shipped) — see §3.
+export HF_HOME=/your/hf/home HF_TOKEN=... WANDB_API_KEY=...
+
+# 2-node smoke / parity (generation-only, no training):
+SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang \
+  bash examples/swe_bench/run_grpo_swe2_scale_gen.sh
+
+# Full convergence run on SGLang (16 nodes = 8 train + 8 gen, reproduces the baseline shape):
+NUM_VLLM_REPLICAS=32 BACKEND=sglang \
+  bash examples/swe_bench/run_grpo_swe2_scale_gen.sh
+```
+
+Expected: multi-turn rollouts complete (**8/8, contiguity failures 0**), **~193 gen tok/s**
+with full CUDA graph (≈ vLLM), and SGLang↔vLLM per-token logprobs agree to within the model's
+own bf16/MoE numerical noise floor (see §6).
+
+---
+
+## 1. What this is
+
+| Item | Value |
+|------|-------|
+| Algorithm | Async GRPO (non-colocated generation) |
+| Model | Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / ~3B active) |
+| Generation backend | **SGLang** (vLLM is the baseline; this recipe swaps it for SGLang) |
+| Init checkpoint | SWE1 `step_230_hf` (same as the vLLM baseline `dc3m70us`) |
+| Env | `swe_agents` (OpenHands agent inside an apptainer/singularity sandbox) |
+| Launcher | `examples/swe_bench/run_grpo_swe2_scale_gen.sh` (`BACKEND=sglang`) |
+| Scheduler | SLURM (`sbatch` + `ray.sub`), enroot+pyxis container runtime |
+
+The goal: make **SGLang** a usable generation backend for this multi-turn SWE-bench recipe,
+proven equivalent to vLLM at (a) rollout completeness, (b) throughput, and (c) per-token
+logprobs (which feed GRPO importance ratios).
+
+---
+
+## 2. Provenance — what's grafted vs. what's new
+
+This branch set is a **graft**, assembled so a single clone is runnable end-to-end:
+
+- **Base:** NeMo-RL `main` (already carries the *basic* SGLang backend).
+- **Grafted from [NVIDIA-NeMo/RL#2447](https://github.com/NVIDIA-NeMo/RL/pull/2447)
+  (`zhw/mxfp8_support`):** the *enhanced* SGLang backend our 30B-MoE recipe needs —
+  Megatron→SGLang weight-refit (`megatron_sglang_weight_iterator.py`), non-colocated
+  weight update, router, fault-tolerance, and the heavier `sglang_worker` / `sglang_generation`.
+  #2447 is an open, evolving PR; this branch **pins a known-good graft of it** so you don't
+  have to track #2447 yourself.
+- **New in this branch set (the genuinely novel work):**
+  1. **★ Gym-proxy token-splicing contiguity fix** (the load-bearing piece, in the Gym fork
+     `responses_api_models/vllm_model/app.py`). Multi-turn SGLang rollouts broke a hard
+     prefix-stability assert in `nemo_gym.py` on ~every tool-using turn (48/48 failures).
+     Fix: build each turn's prompt as `prompt_{K-1} + gen_{K-1}(verbatim) + delta_K`,
+     splicing the prior assistant's **exact sampled token IDs** instead of re-tokenizing
+     (`_build_sglang_prompt_ids`, `_update_sglang_session_seq`, `_sglang_followup_fragment_ids`).
+     Also: SGLang native `/generate` with `return_logprob=True`, and `skip_special_tokens=False`
+     so `</think>` (id 151668) survives the multi-turn re-feed.
+  2. **SWE2 SGLang launcher path** — `BACKEND=sglang` in `run_grpo_swe2_scale_gen.sh`.
+  3. **CUDA-graph perf** — full CUDA graph ON (piecewise off; it crashes on torch-2.10/sglang)
+     → **51 → ~193 tok/s, ≈ vLLM**.
+  4. **Refit OOM / NCCL-deadlock mitigations** — `mem_fraction_static=0.55`,
+     `NRL_REFIT_BUFFER_MEMORY_RATIO=0.018`, `pause_generation_mode=retract`.
+- **Parity instrumentation** (sentinel-gated, harmless when off): in-proxy + in-worker
+  teacher-force hooks used to *prove* logprob parity (§6). Not needed for training.
+
+Pinned SHAs: **RL `c88030f`**, **Gym `50586ec`** (auto-resolved as the submodule).
+
+---
+
+## 3. Prerequisites
+
+### 3.1 Cluster / runtime
+A SLURM cluster with **enroot + pyxis** (so `ray.sub` runs `srun --container-image` natively).
+Validated on CW-DFW (`cw-dfw-cs-001`, H100). 2 nodes for the smoke/parity run; 16 nodes for
+the full convergence run.
+
+### 3.2 Container (SWE training image — has the working hermes tool parser)
+```
+/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs
+```
+Wired via `CONTAINER` (overridable). It bakes mcore + apptainer; the launcher overlays your
+clone's `nemo_rl/` and `3rdparty/Gym-workspace/Gym` over the baked copies, so **your checkout
+is what runs**.
+
+### 3.3 Shared assets (absolute, world-readable on CW-DFW Lustre)
+| Path | Purpose |
+|------|---------|
+| `…/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-…/step_230_hf` | init checkpoint |
+| `…/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl` | train + val data |
+| `…/spanev/swe2-repro/qwen3_swe_chat_template.jinja` | SGLang chat template (Qwen3 thinking) |
+| per-instance `swebench_sweb.eval.x86_64.{instance_id}.sif` | SWE-bench sandbox images (resolved by `container_formatter` in the YAML) |
+
+The exact default paths are in the launcher (`MODEL_PATH`, `TRAIN_DATA_PATH`, `SGLANG_CHAT_TEMPLATE`);
+override via env if your copies live elsewhere.
+
+### 3.4 Credentials (export yourself — not shipped)
+`HF_HOME`, `HF_TOKEN` (gated model), `WANDB_API_KEY` (or `WANDB_MODE=offline`),
+optionally `GITHUB_TOKEN` / `GITLAB_TOKEN`.
+
+---
+
+## 4. How to run
+
+The single launcher is **`examples/swe_bench/run_grpo_swe2_scale_gen.sh`**. One knob,
+`NUM_VLLM_REPLICAS` (R), derives nodes / batch sizes so per-replica work is constant.
+
+| Mode | Command | Footprint |
+|------|---------|-----------|
+| **Smoke / parity** (gen-only, no train) | `SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang bash …/run_grpo_swe2_scale_gen.sh` | 2 nodes (1 gen + 1 train no-op) |
+| **Full convergence** (reproduces baseline shape) | `NUM_VLLM_REPLICAS=32 BACKEND=sglang bash …/run_grpo_swe2_scale_gen.sh` | 16 nodes (8 train + 8 gen) |
+| **Preview only** | add `DRY_RUN=1` | none (prints derived config) |
+
+Job id is written to `${REPO_ROOT}/latest_scale_gen_job_id.txt`. Logs under
+`${REPO_ROOT}/logs/swe_bench_scale/`. wandb project `swe-benchmark`.
+
+### SGLang-specific env toggles
+| Var | Default | Effect |
+|-----|---------|--------|
+| `BACKEND` | `vllm` | set `sglang` to use the SGLang path |
+| `SGLANG_DISABLE_CUDA_GRAPH` | `false` | `true` disables full CUDA graph (slower; ~51 tok/s) |
+| `SGLANG_CHAT_TEMPLATE` | `…/qwen3_swe_chat_template.jinja` | Qwen3-thinking chat template |
+| `TEMPERATURE` | `1.0` | sampling temperature (recipe trains at 1.0) |
+| `UV_CACHE_DIR` | `/tmp/uv_cache` | set to **`/root/.cache/uv`** to reuse the container's prebuilt SGLang wheels and skip a ~40-min build |
+| `SBATCH_ACCOUNT` / `SBATCH_PARTITION` | `nemotron_agents_dev` / `backfill` | SLURM account / partition |
+
+### What `BACKEND=sglang` injects (the generation overrides)
+bf16; `dp=ep=pp=1` with TP via `num_gpus_per_engine=8`; `mem_fraction_static=0.55`;
+`disable_piecewise_cuda_graph=true` + `disable_cuda_graph=${SGLANG_DISABLE_CUDA_GRAPH}`;
+`tool_call_parser=hermes` + `reasoning_parser=qwen3-thinking`; the chat template;
+`pause_generation_mode=retract`; router disabled; and the Gym proxy switched to the SGLang
+engine path (`…vllm_model.engine=sglang`).
+
+---
+
+## 5. Expected results (port parity)
+
+| Run | CUDA graph | gen tok/s | rollout | contiguity_fail |
+|---|---|---|---|---|
+| SGLang | OFF | ~51 | 30:29 | **0** (8/8) |
+| **SGLang** | **ON** (default) | **~193** | **13:34** | **0** (8/8) |
+| vLLM baseline | default | — | 10–16 min | 0 |
+
+Success markers (same as the vLLM baseline): non-zero `train:total_reward/mean` from step ~1,
+logged Gym responses contain real `function_call` items, resolved rate climbs toward ~8%.
+
+---
+
+## 6. (Optional) Reproduce the SGLang↔vLLM logprob parity
+
+The parity hooks are **committed and sentinel-gated** (no effect unless triggered), so a clean
+clone can regenerate the parity numbers. They teacher-force both engines on identical token IDs
+and compare per-token logprobs. Shared dir + scripts live under
+`/lustre/…/spanev/swe2-repro/parity/` (capture `rollouts.jsonl` + `compare_forced.py`).
+
+```bash
+P=/lustre/.../swe2-repro/parity        # capture + scripts + sentinels
+# Input: 52 recs / 27,493 tokens, derived deterministically from the 864-record capture
+#   (filter gen>0 & prompt+gen<=24k, shortest-first) -> rollouts_filtered16.jsonl
+
+# SGLang side: in-proxy /generate teacher-force
+touch $P/SGLANG_TF_TRIGGER
+SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=sglang MAX_NUM_STEPS=1 \
+  bash examples/swe_bench/run_grpo_swe2_scale_gen.sh     # -> forced_sglang.jsonl, SGLANG_TF_DONE
+
+# vLLM side: in-worker engine teacher-force (fires post-refit on weight update)
+touch $P/VLLM_ENGINE_TF_TRIGGER
+SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 BACKEND=vllm MAX_NUM_STEPS=1 \
+  bash examples/swe_bench/run_grpo_swe2_scale_gen.sh     # -> forced_vllm.jsonl, VLLM_ENGINE_TF_DONE
+
+python3 $P/compare_forced.py --sglang $P/forced_sglang.jsonl --vllm $P/forced_vllm.jsonl
+```
+
+**Result (teacher-forced, all 27,493 tokens, real post-refit weights):**
+
+| metric | value | read |
+|---|---|---|
+| median \|Δ logprob\| | **1.38e-3** | training-grade at the typical token |
+| p95 / p99 / max | 0.140 / 0.245 / 1.01 | real tail at high-entropy tokens |
+| top-K KL median | 1.75e-4 | negligible |
+| confident-token bucket [0,0.3) nats median | **3.6e-7** | engines essentially identical where confident |
+| within-engine baseline (SGLang sampled-vs-forced) median | **1.24e-3** | the model's own bf16/MoE noise floor |
+
+**Verdict:** cross-engine median (1.38e-3) ≈ within-engine noise floor (1.24e-3) — **vLLM differs
+from SGLang no more than SGLang differs from itself.** Swapping vLLM→SGLang is safe at the
+logprob level. (Re-validated from a fresh clone on 2026-06-25; numbers match to within run-to-run
+bf16/MoE noise.)
+
+Notes: vLLM's recipe server is chat-only (no `/v1/completions`), so its teacher-force runs
+**in-process inside the worker** (`vllm_worker_async.py`), fired **post-refit** (the engine boots
+with dummy weights and gets the real checkpoint via refit). SGLang's `/generate` supports
+`logprob_start_len`, so its side runs in the Gym proxy.
+
+---
+
+## 7. Gotchas (already handled, listed so you don't re-hit them)
+
+- **Multi-turn contiguity** — solved by the token-splice fix; do not "fix" it by re-tokenizing.
+- **CUDA graph** — full graph works and is ~2× faster; only *piecewise* crashes (kept off).
+- **Refit OOM / NCCL hang** — needs `mem_fraction_static=0.55` + refit bucket cap +
+  `pause_generation_mode=retract` (all baked into the launcher).
+- **Slow first run** — set `UV_CACHE_DIR=/root/.cache/uv` to reuse baked SGLang wheels.
+- **Recipe-managed engines are unreachable from the host** (pyxis container netns) — interact
+  only via the Gym proxy or in-worker hooks (which is what the parity instrumentation does).
+
+---
+
+## 8. Where things live
+| Thing | Location |
+|---|---|
+| This recipe (RL + launcher + parity hooks) | `Kh4L/NemoRL@swe2-qwen-sglang` (`c88030f`) |
+| Splice-fixed + parity-hooked Gym | `Kh4L/NemoGym@swe2-sglang-graft` (`50586ec`, submodule) |
+| vLLM baseline guide | [`REPRO_swe2.md`](./REPRO_swe2.md) |
+| Parity capture + scripts + results | DFW `/lustre/.../spanev/swe2-repro/parity/` |
+| Upstream home of the enhanced SGLang backend | [NVIDIA-NeMo/RL#2447](https://github.com/NVIDIA-NeMo/RL/pull/2447) |
diff --git a/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml b/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml
new file mode 100644
index 0000000000..3859b3083e
--- /dev/null
+++ b/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml
@@ -0,0 +1,424 @@
+# ============================================================================
+# Async GRPO SWE RL Training: Qwen3-30B-A3B-Thinking-2507
+#
+# Model:      Qwen3-30B-A3B-Thinking-2507 (MoE, 30B total / 3B active, thinking)
+# Train data: R2E-Gym (r2e-gym subset, 4518 samples)
+# Eval data:  SWE-bench Verified
+# Mode:       Async GRPO with non-colocated generation
+# Entry:      examples/nemo_gym/run_grpo_nemo_gym.py
+# Env:        swe_agents (OpenHands agent, singularity sandbox)
+#
+# Based on:   baseline/nemo-rl-qwen-swe/grpo_qwen3_30b_thinking_swe.yaml
+# Gym:        main branch (nemo-rl-async-swe repo)
+# ============================================================================
+
+checkpointing:
+  enabled: true
+  checkpoint_dir: "results/grpo-qwen3-30b-thinking-swe-rl"
+  metric_name: "train:total_reward/mean"
+  higher_is_better: true
+  keep_top_k: 100
+  save_period: 5
+  checkpoint_must_save_by: "00:03:35:00"
+  model_save_format: "safetensors"
+  save_consolidated: false
+  save_optimizer: true
+
+grpo:
+  num_prompts_per_step: 16
+  num_generations_per_prompt: 16
+  num_val_generations_per_prompt: 1
+  max_rollout_turns: 1
+  max_num_epochs: 100
+  max_num_steps: 1000000
+  normalize_rewards: true
+  use_leave_one_out_baseline: true
+  advantage_clip_low: -100
+  advantage_clip_high: 100
+  val_period: 10
+  val_at_start: false
+  val_at_end: false
+  overlong_filtering: true
+  max_val_samples: null
+  val_batch_size: 256
+  seed: 42
+  invalid_tool_call_strategy: ""
+
+  use_dynamic_sampling: false
+  dynamic_sampling_max_gen_batches: 10
+  batch_multiplier: 1
+
+  penalize_invalid_tool_call: true
+  invalid_tool_call_advantage: -5.0
+  penalize_malformed_thinking: true
+  malformed_thinking_advantage: -5.0
+
+  reward_shaping:
+    enabled: false
+    overlong_buffer_length: 128
+    overlong_buffer_penalty: 1
+    max_response_length: ${policy.max_total_sequence_length}
+    stop_properly_penalty_coef: null
+  reward_scaling:
+    enabled: false
+    source_min: 0.0
+    source_max: 1.0
+    target_min: 0.0
+    target_max: 1.0
+
+  async_grpo:
+    enabled: true
+    max_trajectory_age_steps: 1
+    in_flight_weight_updates: true
+    recompute_kv_cache_after_weight_updates: false
+
+  seq_logprob_error_threshold: 2
+
+loss_fn:
+  reference_policy_kl_penalty: 0.0
+  reference_policy_kl_type: "k3"
+  kl_input_clamp_value: null
+  kl_output_clamp_value: null
+  ratio_clip_min: 0.2
+  ratio_clip_max: 0.28
+  ratio_clip_c: null
+  use_on_policy_kl_approximation: true
+  use_importance_sampling_correction: true
+  truncated_importance_sampling_ratio: 5.0
+  truncated_importance_sampling_ratio_min: null
+  truncated_importance_sampling_type: tis
+  sequence_level_importance_ratios: false
+  token_level_loss: true
+  force_on_policy_ratio: true
+  use_kl_in_reward: false
+
+policy:
+  model_name: "/lustre/fsw/portfolios/llmservice/users/igitman/hf_models/Qwen3-30B-A3B-Thinking-2507"
+  tokenizer:
+    name: ${policy.model_name}
+    chat_template_kwargs:
+      enable_thinking: true
+  hf_config_overrides: {}
+  train_global_batch_size: 256
+  train_micro_batch_size: 1
+  generation_batch_size: 64
+  logprob_batch_size: 1
+  max_total_sequence_length: 131072
+  precision: "bfloat16"
+  logprob_chunk_size: 2048
+  offload_optimizer_for_logprob: false
+
+  dtensor_cfg:
+    _v2: true
+    enabled: false
+    cpu_offload: False
+    sequence_parallel: false
+    activation_checkpointing: false
+    tensor_parallel_size: 1
+    context_parallel_size: 1
+    custom_parallel_plan: null
+
+  megatron_cfg:
+    enabled: true
+    gradient_accumulation_fusion: false
+    empty_unused_memory_level: 1
+    activation_checkpointing: true
+    tensor_model_parallel_size: 2
+    expert_tensor_parallel_size: 1
+    expert_model_parallel_size: 8
+    pipeline_model_parallel_size: 2
+    num_layers_in_first_pipeline_stage: null
+    num_layers_in_last_pipeline_stage: null
+    context_parallel_size: 4
+    pipeline_dtype: ${policy.precision}
+    sequence_parallel: true
+    freeze_moe_router: true
+    moe_router_dtype: "fp32"
+    moe_router_load_balancing_type: "none"
+    moe_router_bias_update_rate: 1.0e-3
+    moe_permute_fusion: true
+    moe_enable_deepep: false
+    moe_token_dispatcher_type: "alltoall"
+    moe_aux_loss_coeff: 0.0
+    moe_router_enable_expert_bias: true
+    moe_shared_expert_overlap: false
+    apply_rope_fusion: True
+    bias_activation_fusion: False
+    defer_fp32_logits: True
+    moe_per_layer_logging: True
+
+    optimizer:
+      optimizer: "adam"
+      lr: 1.0e-6
+      min_lr: 1.0e-6
+      weight_decay: 0.0
+      bf16: true
+      fp16: false
+      params_dtype: "float32"
+      adam_beta1: 0.9
+      adam_beta2: 0.999
+      adam_eps: 1e-8
+      sgd_momentum: 0.9
+      use_distributed_optimizer: true
+      use_precision_aware_optimizer: true
+      clip_grad: ${policy.max_grad_norm}
+      optimizer_cpu_offload: false
+      optimizer_offload_fraction: 0.0
+
+    scheduler:
+      start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      weight_decay_incr_style: "constant"
+      lr_decay_style: "constant"
+      lr_decay_iters: 1000000
+      lr_warmup_iters: 0
+      lr_warmup_init: 0
+
+    distributed_data_parallel_config:
+      grad_reduce_in_fp32: false
+      overlap_grad_reduce: false
+      overlap_param_gather: false
+      use_custom_fsdp: false
+      data_parallel_sharding_strategy: "optim_grads_params"
+
+    mtp_loss_scaling_factor: 0.0
+    mtp_use_repeated_layer: false
+    mtp_num_layers: 0
+    mtp_detach_heads: false
+
+    fp8_cfg:
+      enabled: false
+      fp8: "e4m3"
+      fp8_recipe: "blockwise"
+      fp8_param: false
+
+    env_vars: null
+
+  dynamic_batching:
+    enabled: False
+    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+    sequence_length_round: 64
+
+  sequence_packing:
+    enabled: True
+    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+    algorithm: "modified_first_fit_decreasing"
+    sequence_length_round: 64
+
+  make_sequence_length_divisible_by: 8
+  max_grad_norm: 1.0
+
+  optimizer: null
+  scheduler: null
+
+  generation:
+    port_range_low: 11001
+    port_range_high: 15000
+    backend: "vllm"
+    max_new_tokens: ${policy.max_total_sequence_length}
+    temperature: 1.0
+    top_p: 1.0
+    top_k: null
+    stop_token_ids: null
+    stop_strings: null
+    vllm_cfg:
+      enable_prefix_caching: true
+      async_engine: true
+      precision: ${policy.precision}
+      kv_cache_dtype: "auto"
+      tensor_parallel_size: 2
+      pipeline_parallel_size: 1
+      expert_parallel_size: 1
+      gpu_memory_utilization: 0.8
+      max_model_len: ${policy.max_total_sequence_length}
+      enforce_eager: False
+      enforce_monotonicity: false
+      use_deep_gemm: False
+      num_last_layers_in_bf16: 0
+      num_first_layers_in_bf16: 0
+      enable_vllm_metrics_logger: true
+      vllm_metrics_logger_interval: 0.5
+      expose_http_server: true
+      skip_tokenizer_init: false
+      enable_thinking: true
+      http_server_serving_chat_kwargs:
+        enable_auto_tools: true
+        tool_parser: hermes
+        reasoning_parser: deepseek_r1
+        chat_template: |
+          {%- if tools %}
+              {{- '<|im_start|>system\n' }}
+              {%- if messages[0].role == 'system' %}
+                  {{- messages[0].content + '\n\n' }}
+              {%- endif %}
+              {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+              {%- for tool in tools %}
+                  {{- "\n" }}
+                  {{- tool | tojson }}
+              {%- endfor %}
+              {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+          {%- else %}
+              {%- if messages[0].role == 'system' %}
+                  {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+              {%- endif %}
+          {%- endif %}
+          {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+          {%- for message in messages[::-1] %}
+              {%- set index = (messages|length - 1) - loop.index0 %}
+              {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+                  {%- set ns.multi_step_tool = false %}
+                  {%- set ns.last_query_index = index %}
+              {%- endif %}
+          {%- endfor %}
+          {%- for message in messages %}
+              {%- if message.content is string %}
+                  {%- set content = message.content %}
+              {%- else %}
+                  {%- set content = '' %}
+              {%- endif %}
+              {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+                  {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+              {%- elif message.role == "assistant" %}
+                  {%- set reasoning_content = '' %}
+                  {%- if message.reasoning_content is string %}
+                      {%- set reasoning_content = message.reasoning_content %}
+                  {%- else %}
+                      {%- if '</think>' in content %}
+                          {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                          {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+                      {%- endif %}
+                  {%- endif %}
+                  {%- if reasoning_content %}
+                      {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+                  {%- else %}
+                      {{- '<|im_start|>' + message.role + '\n' + content }}
+                  {%- endif %}
+                  {%- if message.tool_calls %}
+                      {%- for tool_call in message.tool_calls %}
+                          {%- if (loop.first and content) or (not loop.first) %}
+                              {{- '\n' }}
+                          {%- endif %}
+                          {%- if tool_call.function %}
+                              {%- set tool_call = tool_call.function %}
+                          {%- endif %}
+                          {{- '<tool_call>\n{"name": "' }}
+                          {{- tool_call.name }}
+                          {{- '", "arguments": ' }}
+                          {%- if tool_call.arguments is string %}
+                              {{- tool_call.arguments }}
+                          {%- else %}
+                              {{- tool_call.arguments | tojson }}
+                          {%- endif %}
+                          {{- '}\n</tool_call>' }}
+                      {%- endfor %}
+                  {%- endif %}
+                  {{- '<|im_end|>\n' }}
+              {%- elif message.role == "tool" %}
+                  {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+                      {{- '<|im_start|>user' }}
+                  {%- endif %}
+                  {{- '\n<tool_response>\n' }}
+                  {{- content }}
+                  {{- '\n</tool_response>' }}
+                  {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+                      {{- '<|im_end|>\n' }}
+                  {%- endif %}
+              {%- endif %}
+          {%- endfor %}
+          {%- if add_generation_prompt %}
+              {{- '<|im_start|>assistant\n<think>\n' }}
+          {%- endif %}
+        default_chat_template_kwargs:
+          enable_thinking: true
+          truncate_history_thinking: false
+
+    vllm_kwargs:
+      mamba_ssm_cache_dtype: "float32"
+      compilation_config:
+        cudagraph_capture_sizes: [1,2,4,8,16,32,64]
+
+    colocated:
+      enabled: false
+      resources:
+        gpus_per_node: 8
+        num_nodes: 4
+
+data:
+  max_input_seq_length: null
+  shuffle: false
+  num_workers: 1
+  use_multiple_dataloader: false
+  train:
+    data_path: "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl"
+  validation:
+    data_path: "/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl"
+  default:
+    dataset_name: NemoGymDataset
+    env_name: "nemo_gym"
+    prompt_file: null
+    system_prompt_file: null
+    processor: "nemo_gym_data_processor"
+
+env:
+  should_use_nemo_gym: true
+  should_log_nemo_gym_responses: false
+  nemo_gym:
+    skip_venv_if_present: true
+    port_range_low: 15001
+    port_range_high: 20000
+    config_paths:
+    - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
+    - responses_api_agents/swe_agents/configs/swebench_openhands_training.yaml
+    swe_agents_train:
+      responses_api_agents:
+        swe_agents:
+          agent_max_turns: 100
+          concurrency: 768
+          swebench_agent_timeout: 3600
+          run_with_mixed_prompts: true
+          dataset_path: ${data.train.data_path}
+          container_formatter:
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif"
+    swe_agents_val:
+      responses_api_agents:
+        swe_agents:
+          agent_max_turns: 200
+          concurrency: 768
+          swebench_agent_timeout: 3600
+          dataset_path: ${data.validation.data_path}
+          container_formatter:
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/namanjain12_{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/sdevare/swe_sweapro/images_train/sweap.{instance_id}.sif"
+          - "/lustre/fsw/portfolios/llmservice/users/igitman/images/swe-bench/swebench_sweb.eval.x86_64{instance_id}.sif"
+    use_absolute_ip: true
+
+logger:
+  log_dir: "logs"
+  num_val_samples_to_print: 0
+  wandb_enabled: true
+  tensorboard_enabled: false
+  mlflow_enabled: false
+  monitor_gpus: true
+  swanlab_enabled: false
+  wandb:
+    project: "ruit-nemo-rl"
+    name: "qwen3-30b-thinking-swe-rl"
+  tensorboard: {}
+  mlflow:
+    experiment_name: "qwen3-30b-thinking-swe-rl"
+    run_name: "qwen3-30b-thinking-swe-rl"
+  gpu_monitoring:
+    collection_interval: 10
+    flush_interval: 10
+
+cluster:
+  gpus_per_node: 8
+  num_nodes: 16
diff --git a/examples/swe_bench/run_grpo_repro_baseline_swe2.sh b/examples/swe_bench/run_grpo_repro_baseline_swe2.sh
new file mode 100644
index 0000000000..52e99a0217
--- /dev/null
+++ b/examples/swe_bench/run_grpo_repro_baseline_swe2.sh
@@ -0,0 +1,384 @@
+#!/bin/bash
+# ============================================================================
+# REPRO of baseline's SUCCESSFUL SWE2 run (wandb nvidia/binhu-nemo-rl/dc3m70us).
+#
+# Goal: reproduce the working setup that resolves ~8% from step 1, to confirm
+# the zero-reward issue was the container/vLLM (broken hermes tool parser), not
+# the model or config.
+#
+# Matched to dc3m70us:
+#   Code:       NeMo-RL @ commit a760f1c (current dir)
+#   Container:  ruit-swe_bench (mcore + apptainer; vLLM where the hermes tool
+#               parser works -> agent makes real tool calls)
+#   COMMAND:    baseline-style (uv run --frozen, NO --extra mcore,
+#               NRL_IGNORE_VERSION_MISMATCH=1, NEMO_GYM_SKIP_VENV_IF_PRESENT=1)
+#   Parallel:   TP=2, EP=8, CP=4, PP=2 (dc3m70us used TP=2, NOT TP=4)
+#   Model:      SWE1 step_230_hf
+#
+# Kept as ruit's: account, env source, cache, wandb project (swe-benchmark).
+#
+# Usage:  bash examples/swe_bench/run_grpo_repro_baseline_swe2.sh
+# ============================================================================
+
+set -e
+
+# ============================ Paths ============================
+# Auto-detected from this script's location (examples/swe_bench/), so the run
+# works from any clone of the repo. Override by exporting REPO_ROOT.
+REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}"
+CONFIG_FILE="${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml"
+CHECKPOINT_ROOT="${REPO_ROOT}/results"
+TRAIN_DATA_PATH="/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl"
+VAL_DATA_PATH="${TRAIN_DATA_PATH}"
+# SWE1 step_230 HF checkpoint (exactly what dc3m70us trained from).
+DEFAULT_MODEL_PATH="/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf"
+MODEL_PATH="${1:-${MODEL_PATH:-${DEFAULT_MODEL_PATH}}}"
+
+# ================ Container and mount config ================
+# SWE training container (mcore + apptainer baked in, working hermes tool parser
+# so the agent emits real tool calls).
+export CONTAINER=${CONTAINER:-/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs}
+GYM_CODE="${REPO_ROOT}/3rdparty/Gym-workspace/Gym"
+export MOUNTS="/lustre:/lustre,$PWD:$PWD,${GYM_CODE}:/opt/nemo-rl/3rdparty/Gym-workspace/Gym"
+
+# ======================= Cluster / resources =======================
+# SKIP_TRAINING=1 -> generation-only benchmark: training is a no-op (no optimizer,
+# weights frozen, refit every step + keep-alive matmul), pinned to ONE node.
+SKIP_TRAINING="${SKIP_TRAINING:-0}"
+NUM_ACTOR_NODES=${NUM_NODES:-16}
+NUM_GENERATION_NODES=${NUM_GEN_NODES:-8}   # only used in async (non-colocated) mode
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  # no real training -> 1 training node suffices (train_nodes = total - gen).
+  NUM_ACTOR_NODES=$(( NUM_GENERATION_NODES + 1 ))
+fi
+NUM_GPU=8
+export GPUS_PER_NODE=${NUM_GPU}
+export CPUS_PER_WORKER=114
+
+# ============================ Parallelism ============================
+# SKIP_TRAINING -> training must fit 1 node, so model_parallel = TP*CP*PP <= 8.
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  TP=8; EP=8; CP=1; PP=1     # model_parallel=8 (fits 1 node), train_DP=1
+else
+  TP=4; EP=8; CP=4; PP=2     # dc3m70us-style real training
+fi
+VLLM_TP=2
+MIN_PAD=1
+if [ ${CP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * CP * 2)); fi
+if [ ${TP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * TP)); fi
+MAKE_SEQ_DIVISIBLE_BY=${MIN_PAD}
+
+# ===================== Sequence length & packing =====================
+SEQLEN=131072
+SEQUENCE_PACKING=True
+
+# ================= Sync/Async mode & async GRPO settings =================
+ASYNC_GRPO_ENABLED=True
+MAX_TRAJECTORY_AGE_STEPS=1
+FORCE_ON_POLICY_RATIO=True
+INFLIGHT_WEIGHT_UPDATE=True
+RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES=False
+SEQ_LOGPROB_ERROR_THRESHOLD=null
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  COLOCATED_ENABLED=False
+  VLLM_GPU_UTIL=0.8
+  OVERLAP_GRAD_REDUCE=False
+  ADVANTAGE_CLIP_LOW=-100
+  ADVANTAGE_CLIP_HIGH=100
+  TIS_THRESHOLD=5
+else
+  COLOCATED_ENABLED=True
+  VLLM_GPU_UTIL=0.5
+  OVERLAP_GRAD_REDUCE=True
+fi
+
+# ========================= GRPO / sampling =========================
+PPS=8
+GPP=8
+GBS=64
+NORMALIZE_REWARDS=True
+OVERLONG_FILTERING=True
+
+# ========================== Loss function ==========================
+KL=0
+CLIP_MIN=0.2
+CLIP_MAX=0.28
+USE_ON_POLICY_KL_APPROXIMATION=True
+IMPORTANCE_SAMPLING_CORRECTION=True
+SEQ_LEVEL_IS=False
+TOKEN_LEVEL_LOSS=True
+
+# ============================ Optimizer ============================
+LR="1e-06"
+
+# =============================== MoE ===============================
+MOE_FREEZE_ROUTER=True
+MOE_PERMUTE_FUSION=True
+MOE_ENABLE_DEEPEP=False
+MOE_TOKEN_DISPATCHER_TYPE="alltoall"
+MOE_AUX_LOSS_COEFF=0
+MOE_ROUTER_LOAD_BALANCING_TYPE="none"
+MOE_ROUTER_BIAS_UPDATE_RATE="1e-3"
+
+# ======================= Generation / vLLM =======================
+TEMPERATURE=1.0
+
+# =================== Checkpointing & validation ===================
+SAVE_PERIOD=5
+VAL_PERIOD=1000
+KEEP_TOP_K=2
+
+# ============================ SWE agent ============================
+AGENT_MAX_TURNS=200
+AGENT_TIMEOUT=1800
+
+# ============================== Logging ==============================
+WANDB_PROJ="swe-benchmark"
+# Log full trajectories to wandb so we can verify function_call items appear.
+LOG_GYM_RESPONSES=true
+
+# ========================= SLURM submission =========================
+SBATCH_ACCOUNT="nemotron_sw_post"
+SBATCH_PARTITION="batch"
+SBATCH_TIME="4:0:0"
+
+# ========================= Experiment naming =========================
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  SYNC_MODE="async-age${MAX_TRAJECTORY_AGE_STEPS}"
+else
+  SYNC_MODE="sync"
+fi
+MODE_TAG=""
+if [ "${SKIP_TRAINING}" = "1" ]; then MODE_TAG="notrain-"; fi
+EXP_SUFFIX="${EXP_SUFFIX:-repro-baseline-swe2-${MODE_TAG}${SYNC_MODE}-pps${PPS}-gpp${GPP}-gbs${GBS}-lr${LR}-tp${TP}}"
+WANDB_NAME="${EXP_SUFFIX}"
+CHECKPOINT_DIR="${CHECKPOINT_ROOT}/${EXP_SUFFIX}"
+SNAPSHOT_DIR="${REPO_ROOT}"
+
+mkdir -p "${CHECKPOINT_DIR}"
+
+# ============= Unified SLURM/Ray log location =============
+export BASE_LOG_DIR="${BASE_LOG_DIR:-${SNAPSHOT_DIR}/logs/slurm}"
+mkdir -p "${BASE_LOG_DIR}"
+
+# ========================= Environment variables =========================
+# NOTE: credentials have been removed from this shared copy. Before running,
+# export the following yourself (e.g. from your own env script):
+#   HF_HOME=...   HF_TOKEN=...   (HuggingFace cache + token)
+export HF_DATASETS_CACHE="${HF_DATASETS_CACHE:-${HF_HOME}/datasets}"
+export UV_CACHE_DIR=/tmp/uv_cache
+export UV_LOCK_TIMEOUT=3600
+export RAY_DEDUP_LOGS=1
+export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
+export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
+export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
+export OMP_NUM_THREADS=16
+
+# ========================= Node-local cache config =========================
+# Defaults under your own $HOME so you don't write into anyone else's dir.
+PERSISTENT_CACHE="${PERSISTENT_CACHE:-${HOME}/.cache/qwen3_30b_thinking_swe_repro_baseline}"
+export LUSTRE_VLLM_CACHE="${PERSISTENT_CACHE}/vllm_compile_cache"
+export LUSTRE_INDUCTOR_CACHE="${PERSISTENT_CACHE}/inductor_cache"
+export LUSTRE_TRITON_CACHE="${PERSISTENT_CACHE}/triton_cache"
+export NRL_VLLM_LOCAL_CACHE_DIR="/tmp/nemo_rl_vllm_cache"
+export NRL_VLLM_CACHE_SEED_DIR="/tmp/nemo_rl_vllm_cache_warm"
+export INDUCTOR_CACHE_DIR="/tmp/nemo_rl_inductor_cache"
+export TRITON_CACHE_DIR="/tmp/nemo_rl_triton_cache"
+export CACHE_SYNC_FREQUENCY=120
+mkdir -p "${LUSTRE_VLLM_CACHE}" "${LUSTRE_INDUCTOR_CACHE}" "${LUSTRE_TRITON_CACHE}"
+
+# ============================== Summary ==============================
+echo "=========================================="
+echo "REPRO of baseline dc3m70us | Experiment: ${EXP_SUFFIX}"
+echo "Container: ruit-swe_bench (mcore+apptainer) | SKIP_TRAINING=${SKIP_TRAINING}"
+echo "Mode: ${SYNC_MODE}, Colocated: ${COLOCATED_ENABLED}"
+echo "Nodes: ${NUM_ACTOR_NODES} total (train=$(( NUM_ACTOR_NODES - NUM_GENERATION_NODES )), gen=${NUM_GENERATION_NODES}), GPUs/node: ${NUM_GPU}"
+echo "Parallelism: TP=${TP}, EP=${EP}, CP=${CP}, PP=${PP}, vLLM_TP=${VLLM_TP}, pad=${MAKE_SEQ_DIVISIBLE_BY}"
+echo "Training: PPS=${PPS}, GPP=${GPP}, GBS=${GBS}, LR=${LR}"
+echo "Model: ${MODEL_PATH}"
+echo "Checkpoint: ${CHECKPOINT_DIR}"
+echo "=========================================="
+
+cd "${SNAPSHOT_DIR}"
+
+# ================ SETUP_COMMAND (baseline's: install apptainer + seed caches + uv sync) ================
+read -r -d '' SETUP_COMMAND <<SETUPEOF || true
+echo "[SETUP] Installing apptainer for SWE sandbox..."
+apt-get update && apt-get install -y git build-essential gcc wget 2>/dev/null || true
+RET=1
+RETRIES=3
+for attempt in \$(seq 1 \$RETRIES); do
+  if command -v apptainer >/dev/null 2>&1 || command -v singularity >/dev/null 2>&1; then
+    echo "[SETUP] singularity/apptainer already available"
+    RET=0
+    break
+  fi
+  cd /tmp && \
+  wget --no-check-certificate -q https://github.com/apptainer/apptainer/releases/download/v1.3.1/apptainer_1.3.1_amd64.deb && \
+  apt install -y ./apptainer_1.3.1_amd64.deb && \
+  ln -sf /usr/bin/apptainer /usr/bin/singularity
+  if command -v apptainer >/dev/null 2>&1; then
+    echo "[SETUP] apptainer installed successfully"
+    RET=0
+    break
+  fi
+  echo "[SETUP] apptainer install attempt \$attempt failed, retrying..."
+  sleep 10
+done
+if [ \$RET -ne 0 ]; then
+  echo "[SETUP] WARNING: apptainer installation failed after \$RETRIES attempts"
+fi
+
+echo "[CACHE SEED] Clearing stale /tmp caches and seeding from Lustre..."
+rm -rf /tmp/nemo_rl_vllm_cache /tmp/nemo_rl_vllm_cache_*
+rm -rf "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}"
+mkdir -p "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}"
+
+find "${LUSTRE_INDUCTOR_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true
+find "${LUSTRE_TRITON_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true
+
+_seed_cache() {
+  local lustre="\$1" local_dir="\$2" name="\$3"
+  if [ -d "\$lustre" ] && [ "\$(ls -A "\$lustre" 2>/dev/null)" ]; then
+    rsync -a --exclude '.tmp_*' "\$lustre/" "\$local_dir/" 2>/dev/null \
+      && echo "[CACHE SEED] \$name: seeded from Lustre" \
+      || echo "[CACHE SEED] \$name: seed failed (non-fatal)"
+  else
+    echo "[CACHE SEED] \$name: no warm cache on Lustre yet"
+  fi
+}
+
+_seed_cache "${LUSTRE_INDUCTOR_CACHE}" "${INDUCTOR_CACHE_DIR}" "Inductor"
+_seed_cache "${LUSTRE_TRITON_CACHE}" "${TRITON_CACHE_DIR}" "Triton"
+echo "[CACHE SEED] Done."
+
+UV_HTTP_TIMEOUT=3600 \
+  uv sync --frozen --extra mcore
+SETUPEOF
+export SETUP_COMMAND
+
+# ================ Training command (baseline-style: uv run --frozen, no --extra mcore) ================
+export COMMAND="NRL_VLLM_USE_V1=1 \
+  NRL_WG_USE_RAY_REF=1 \
+  HF_HOME=${HF_HOME} \
+  HF_DATASETS_CACHE=${HF_DATASETS_CACHE} \
+  UV_CACHE_DIR=${UV_CACHE_DIR} \
+  VLLM_ATTENTION_BACKEND=FLASH_ATTN \
+  VLLM_CACHE_ROOT=${LUSTRE_VLLM_CACHE} \
+  DG_JIT_CACHE_DIR=${LUSTRE_VLLM_CACHE}/deep_gemm \
+  VLLM_DEEP_GEMM_WARMUP=skip \
+  NRL_FORCE_REBUILD_VENVS=false \
+  NRL_IGNORE_VERSION_MISMATCH=1 \
+  RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 \
+  UV_HTTP_TIMEOUT=3600 \
+  UV_LOCK_TIMEOUT=900 \
+  TORCH_CUDA_ARCH_LIST='9.0 10.0' \
+  NEMO_GYM_SKIP_VENV_IF_PRESENT=1 \
+  uv run --frozen --extra mcore ./examples/nemo_gym/run_grpo_nemo_gym.py \
+  --config=${CONFIG_FILE} \
+  cluster.num_nodes=${NUM_ACTOR_NODES} \
+  cluster.gpus_per_node=${NUM_GPU} \
+  ++data.train.data_path=${TRAIN_DATA_PATH} \
+  ++data.validation.data_path=${VAL_DATA_PATH} \
+  grpo.num_prompts_per_step=${PPS} \
+  grpo.num_generations_per_prompt=${GPP} \
+  grpo.val_at_start=False \
+  grpo.normalize_rewards=${NORMALIZE_REWARDS} \
+  grpo.overlong_filtering=${OVERLONG_FILTERING} \
+  grpo.val_period=${VAL_PERIOD} \
+  grpo.seq_logprob_error_threshold=${SEQ_LOGPROB_ERROR_THRESHOLD} \
+  grpo.async_grpo.enabled=${ASYNC_GRPO_ENABLED} \
+  grpo.async_grpo.in_flight_weight_updates=${INFLIGHT_WEIGHT_UPDATE} \
+  grpo.async_grpo.recompute_kv_cache_after_weight_updates=${RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES} \
+  grpo.async_grpo.max_trajectory_age_steps=${MAX_TRAJECTORY_AGE_STEPS} \
+  env.should_log_nemo_gym_responses=${LOG_GYM_RESPONSES} \
+  policy.generation.colocated.enabled=${COLOCATED_ENABLED} \
+  policy.model_name=${MODEL_PATH} \
+  policy.max_total_sequence_length=${SEQLEN} \
+  policy.dynamic_batching.enabled=False \
+  policy.train_global_batch_size=${GBS} \
+  policy.make_sequence_length_divisible_by=${MAKE_SEQ_DIVISIBLE_BY} \
+  policy.offload_optimizer_for_logprob=true \
+  policy.sequence_packing.enabled=${SEQUENCE_PACKING} \
+  policy.megatron_cfg.tensor_model_parallel_size=${TP} \
+  policy.megatron_cfg.expert_model_parallel_size=${EP} \
+  policy.megatron_cfg.context_parallel_size=${CP} \
+  policy.megatron_cfg.pipeline_model_parallel_size=${PP} \
+  policy.megatron_cfg.sequence_parallel=True \
+  policy.megatron_cfg.bias_activation_fusion=False \
+  policy.megatron_cfg.distributed_data_parallel_config.overlap_grad_reduce=${OVERLAP_GRAD_REDUCE} \
+  policy.megatron_cfg.moe_permute_fusion=${MOE_PERMUTE_FUSION} \
+  policy.megatron_cfg.moe_enable_deepep=${MOE_ENABLE_DEEPEP} \
+  policy.megatron_cfg.moe_token_dispatcher_type=${MOE_TOKEN_DISPATCHER_TYPE} \
+  policy.megatron_cfg.moe_aux_loss_coeff=${MOE_AUX_LOSS_COEFF} \
+  policy.megatron_cfg.moe_router_load_balancing_type=${MOE_ROUTER_LOAD_BALANCING_TYPE} \
+  policy.megatron_cfg.moe_router_bias_update_rate=${MOE_ROUTER_BIAS_UPDATE_RATE} \
+  policy.megatron_cfg.freeze_moe_router=${MOE_FREEZE_ROUTER} \
+  policy.megatron_cfg.optimizer.lr=${LR} \
+  policy.megatron_cfg.optimizer.min_lr=${LR} \
+  policy.megatron_cfg.optimizer.weight_decay=0 \
+  policy.megatron_cfg.empty_unused_memory_level=2 \
+  policy.megatron_cfg.activation_checkpointing=True \
+  policy.generation.temperature=${TEMPERATURE} \
+  policy.generation.vllm_cfg.tensor_parallel_size=${VLLM_TP} \
+  policy.generation.vllm_cfg.gpu_memory_utilization=${VLLM_GPU_UTIL} \
+  policy.generation.vllm_cfg.skip_tokenizer_init=False \
+  loss_fn.reference_policy_kl_penalty=${KL} \
+  loss_fn.ratio_clip_min=${CLIP_MIN} \
+  loss_fn.ratio_clip_max=${CLIP_MAX} \
+  loss_fn.use_on_policy_kl_approximation=${USE_ON_POLICY_KL_APPROXIMATION} \
+  loss_fn.use_importance_sampling_correction=${IMPORTANCE_SAMPLING_CORRECTION} \
+  loss_fn.sequence_level_importance_ratios=${SEQ_LEVEL_IS} \
+  loss_fn.token_level_loss=${TOKEN_LEVEL_LOSS} \
+  loss_fn.force_on_policy_ratio=${FORCE_ON_POLICY_RATIO} \
+  checkpointing.checkpoint_dir=${CHECKPOINT_DIR} \
+  checkpointing.save_period=${SAVE_PERIOD} \
+  checkpointing.keep_top_k=${KEEP_TOP_K} \
+  ++checkpointing.metric_name=train:total_reward/mean \
+  ++checkpointing.checkpoint_must_save_by=00:03:35:00 \
+  logger.wandb_enabled=True \
+  logger.wandb.name=${WANDB_NAME} \
+  logger.wandb.project=${WANDB_PROJ}"
+
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  export COMMAND="${COMMAND} \
+  policy.generation.colocated.resources.num_nodes=${NUM_GENERATION_NODES} \
+  policy.generation.colocated.resources.gpus_per_node=${NUM_GPU} \
+  grpo.advantage_clip_low=${ADVANTAGE_CLIP_LOW} \
+  grpo.advantage_clip_high=${ADVANTAGE_CLIP_HIGH} \
+  loss_fn.truncated_importance_sampling_ratio=${TIS_THRESHOLD} \
+  env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \
+  env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \
+  env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \
+  env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT}"
+fi
+
+# Generation-only benchmark: no-op training (no optimizer) + disable checkpoint saving.
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  export COMMAND="${COMMAND} ++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false"
+fi
+
+# ================ Submit job ================
+sbatch \
+  --nodes="${NUM_ACTOR_NODES}" \
+  --account="${SBATCH_ACCOUNT}" \
+  --job-name="${WANDB_NAME}" \
+  --partition="${SBATCH_PARTITION}" \
+  --time="${SBATCH_TIME}" \
+  --gres=gpu:${NUM_GPU} \
+  --output="${BASE_LOG_DIR}/slurm-%j.out" \
+  --exclusive \
+  --dependency=singleton \
+  --comment='{"OccupiedIdleGPUsJobReaper":{"exemptIdleTimeMins":"180","reason":"data_loading","description":"Async GRPO SWE2 repro of baseline dc3m70us"}}' \
+  ray.sub | tee /dev/stderr | grep -o '[0-9]\+' > latest_repro_baseline_job_id.txt
+
+JOB_ID="$(cat latest_repro_baseline_job_id.txt)"
+echo "=========================================="
+echo "Job submitted: ${EXP_SUFFIX}"
+echo "Job ID: ${JOB_ID}"
+echo "Monitor with: squeue -j ${JOB_ID}"
+echo "Ray/SLURM logs: ${BASE_LOG_DIR}/${JOB_ID}-logs/"
+echo "Checkpoints: ${CHECKPOINT_DIR}/"
+echo "=========================================="
+
+cd - > /dev/null
diff --git a/examples/swe_bench/run_grpo_swe2_scale_gen.sh b/examples/swe_bench/run_grpo_swe2_scale_gen.sh
new file mode 100644
index 0000000000..23533985de
--- /dev/null
+++ b/examples/swe_bench/run_grpo_swe2_scale_gen.sh
@@ -0,0 +1,502 @@
+#!/bin/bash
+# ============================================================================
+# GENERATION-SCALING launcher for async SWE GRPO (derived from
+# run_grpo_repro_bihu_swe2.sh / bihu dc3m70us).
+#
+# Single knob:  NUM_VLLM_REPLICAS (R)  -> number of vLLM generation replicas.
+# Everything else is auto-derived to hold these invariants constant so that
+# runs at different R are directly comparable:
+#   - per generation-replica workload : samples/replica/step = 2
+#   - per training-GPU workload       : GBS / train_DP       = 32
+#   - train:gen node ratio            : 1:1 (matches the bihu 8+8 baseline)
+#
+# Derivation (REPLICAS_PER_NODE = gpus_per_node / VLLM_TP = 8/2 = 4):
+#   GEN_NODES   = R / 4
+#   TRAIN_NODES = R / 4                 (linear follow; override with TRAIN_NODES=)
+#   TOTAL_NODES = TRAIN_NODES+GEN_NODES = R/2   -> sbatch --nodes & cluster.num_nodes
+#   PPS         = 2*R / GPP             = R/4
+#   GBS         = PPS*GPP               = 2*R   (force_on_policy_ratio requires ==)
+#   CONCURRENCY = max(768, GBS*age)
+# R must be a multiple of 16 (train world = 2R must satisfy Megatron
+# model-parallel & expert-parallel divisibility; gen must fill whole nodes).
+# R=32 exactly reproduces the bihu repro (16 nodes = 8+8, PPS=8, GBS=64).
+#
+# All runs of this sweep share one wandb group (WANDB_GROUP) under project
+# swe-benchmark for easy comparison.
+#
+# Usage:
+#   NUM_VLLM_REPLICAS=64 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh
+#   NUM_VLLM_REPLICAS=64 DRY_RUN=1 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh   # print config, no submit
+#   SKIP_TRAINING=1 NUM_VLLM_REPLICAS=4 bash examples/swe_bench/run_grpo_swe2_scale_gen.sh  # generation-only (no-op train, 1 node, R%4)
+# Optional env: SKIP_TRAINING, TRAIN_NODES, WANDB_GROUP, EXP_SUFFIX, MODEL_PATH, CONTAINER,
+#               MAX_NUM_STEPS, SBATCH_TIME, PERSISTENT_CACHE, BASE_LOG_DIR
+# Credentials are NOT sourced here — export HF_HOME / HF_TOKEN / WANDB_API_KEY yourself.
+# ============================================================================
+
+set -e
+
+# ============================ Paths ============================
+# Auto-detected from this script's location (examples/swe_bench/), so it works from
+# any clone of the repo. Override by exporting REPO_ROOT.
+REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)}"
+CONFIG_FILE="${REPO_ROOT}/examples/swe_bench/grpo_qwen3_30b_async_swe.yaml"
+CHECKPOINT_ROOT="${REPO_ROOT}/results"
+TRAIN_DATA_PATH="/lustre/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/sdevare/repos/nano/dataset/rl/swe_all_datasets_train_w_agent_ref_r2e_gym_subset.jsonl"
+VAL_DATA_PATH="${TRAIN_DATA_PATH}"
+# SWE1 step_230 HF checkpoint (exactly what dc3m70us trained from).
+DEFAULT_MODEL_PATH="/lustre/fsw/portfolios/coreai/users/bihu/repos/nemo-rl-async-swe/results/qwen3-30b-thinking-swe1-async-age1-pps64-gpp8-gbs512-lr1e-06/step_230_hf"
+MODEL_PATH="${1:-${MODEL_PATH:-${DEFAULT_MODEL_PATH}}}"
+
+# ================ Container and mount config ================
+# SWE training container (mcore + apptainer, working hermes tool parser).
+export CONTAINER=${CONTAINER:-/lustre/fsw/portfolios/coreai/users/ruit/enroot-images/docker_images:ruit-swe_bench-6de99f772-x86_64-060326-mcore-apptainer.squashfs}
+GYM_CODE="${REPO_ROOT}/3rdparty/Gym-workspace/Gym"
+export MOUNTS="/lustre:/lustre,$PWD:$PWD,${GYM_CODE}:/opt/nemo-rl/3rdparty/Gym-workspace/Gym,$PWD/nemo_rl:/opt/nemo-rl/nemo_rl"
+
+# ======================= Cluster / resources =======================
+NUM_GPU=8
+export GPUS_PER_NODE=${NUM_GPU}
+export CPUS_PER_WORKER=114
+
+# ============================ Parallelism ============================
+# SKIP_TRAINING=1 -> generation-only benchmark: training is a no-op on a SINGLE node
+# (no optimizer, weights frozen, refit every step + keep-alive matmul). Training
+# parallelism must fit 1 node, so model_parallel = TP*CP*PP must divide gpus_per_node(=8).
+SKIP_TRAINING="${SKIP_TRAINING:-0}"
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  TP=8; EP=8; CP=1; PP=1; ETP=1     # model_parallel = 8 (fits 1 node), train_DP=1
+else
+  TP=4; EP=8; CP=4; PP=2; ETP=1     # linear-train default (model_parallel=32)
+fi
+VLLM_TP=2
+BACKEND="${BACKEND:-vllm}"   # vllm | sglang
+SGLANG_CHAT_TEMPLATE="${SGLANG_CHAT_TEMPLATE:-/lustre/fsw/portfolios/llmservice/users/spanev/swe2-repro/qwen3_swe_chat_template.jinja}"
+MIN_PAD=1
+if [ ${CP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * CP * 2)); fi
+if [ ${TP} -gt 1 ]; then MIN_PAD=$((MIN_PAD * TP)); fi
+MAKE_SEQ_DIVISIBLE_BY=${MIN_PAD}
+
+# ================= Generation-scaling: derive all sizes from R =================
+GPP=8                                            # generations per prompt (fixed)
+SAMPLES_PER_REPLICA=2                             # invariant: samples/replica/step
+BASE_CONCURRENCY=768                              # nemo-gym fan-out floor
+REPLICAS_PER_NODE=$(( NUM_GPU / VLLM_TP ))        # = 4
+MODEL_PARALLEL=$(( TP * CP * PP ))                # = 32
+EXPERT_TMP=$(( ETP * EP * PP ))                   # = 16
+
+NUM_VLLM_REPLICAS="${NUM_VLLM_REPLICAS:-}"
+if [ -z "${NUM_VLLM_REPLICAS}" ]; then
+  echo "ERROR: NUM_VLLM_REPLICAS is required (number of vLLM replicas). e.g. NUM_VLLM_REPLICAS=64" >&2
+  exit 1
+fi
+
+# Smallest valid step for R.
+gcd() { local a=$1 b=$2 t; while [ ${b} -ne 0 ]; do t=${b}; b=$(( a % b )); a=${t}; done; echo ${a}; }
+lcm() { echo $(( $1 / $(gcd $1 $2) * $2 )); }
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  # train fixed at 1 node (train_world=8, divisible by model_parallel=8); only gen
+  # must fill whole nodes -> R need only be a multiple of REPLICAS_PER_NODE (=4).
+  R_STEP=${REPLICAS_PER_NODE}
+else
+  # linear train: train_world=2R must be divisible by model-parallel & expert sizes.
+  L=$(lcm ${MODEL_PARALLEL} ${EXPERT_TMP})          # train-world divisor
+  R_STEP_TRAIN=$(( L / $(gcd 2 ${L}) ))             # since train_world = 2R
+  R_STEP=$(lcm ${R_STEP_TRAIN} ${REPLICAS_PER_NODE})
+fi
+if [ $(( NUM_VLLM_REPLICAS % R_STEP )) -ne 0 ] || [ ${NUM_VLLM_REPLICAS} -lt ${R_STEP} ]; then
+  echo "ERROR: NUM_VLLM_REPLICAS must be a positive multiple of ${R_STEP} (got ${NUM_VLLM_REPLICAS})." >&2
+  exit 1
+fi
+
+GEN_NODES=$(( NUM_VLLM_REPLICAS / REPLICAS_PER_NODE ))
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  TRAIN_NODES="${TRAIN_NODES:-1}"                 # no-op training: single node
+else
+  TRAIN_NODES="${TRAIN_NODES:-${GEN_NODES}}"      # linear 1:1 follow by default
+fi
+TOTAL_NODES=$(( TRAIN_NODES + GEN_NODES ))
+PPS=$(( SAMPLES_PER_REPLICA * NUM_VLLM_REPLICAS / GPP ))
+GBS=$(( PPS * GPP ))
+CONCURRENCY=$(( GBS * 1 ))                         # GBS * max_trajectory_age_steps(=1)
+if [ ${CONCURRENCY} -lt ${BASE_CONCURRENCY} ]; then CONCURRENCY=${BASE_CONCURRENCY}; fi
+
+# Sanity: training divisibility (also re-checks any TRAIN_NODES override).
+TRAIN_WORLD=$(( TRAIN_NODES * NUM_GPU ))
+if [ $(( TRAIN_WORLD % MODEL_PARALLEL )) -ne 0 ] || [ $(( TRAIN_WORLD % EXPERT_TMP )) -ne 0 ]; then
+  echo "ERROR: train world ${TRAIN_WORLD} (TRAIN_NODES=${TRAIN_NODES}) not divisible by model-parallel ${MODEL_PARALLEL} / expert ${EXPERT_TMP}." >&2
+  exit 1
+fi
+TRAIN_DP=$(( TRAIN_WORLD / MODEL_PARALLEL ))
+if [ $(( GBS % TRAIN_DP )) -ne 0 ]; then
+  echo "ERROR: GBS ${GBS} not divisible by train DP ${TRAIN_DP}." >&2
+  exit 1
+fi
+PER_GPU_BATCH=$(( GBS / TRAIN_DP ))
+PER_REPLICA_SAMPLES=$(( GBS / NUM_VLLM_REPLICAS ))
+
+# ===================== Sequence length & packing =====================
+SEQLEN=131072
+SEQUENCE_PACKING=True
+
+# ================= Sync/Async mode & async GRPO settings =================
+ASYNC_GRPO_ENABLED=True
+MAX_TRAJECTORY_AGE_STEPS=1
+FORCE_ON_POLICY_RATIO=True
+INFLIGHT_WEIGHT_UPDATE=True
+RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES=False
+SEQ_LOGPROB_ERROR_THRESHOLD=null
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  COLOCATED_ENABLED=False
+  VLLM_GPU_UTIL=0.8
+  OVERLAP_GRAD_REDUCE=False
+  ADVANTAGE_CLIP_LOW=-100
+  ADVANTAGE_CLIP_HIGH=100
+  TIS_THRESHOLD=5
+else
+  COLOCATED_ENABLED=True
+  VLLM_GPU_UTIL=0.5
+  OVERLAP_GRAD_REDUCE=True
+fi
+
+# ========================= GRPO / sampling =========================
+NORMALIZE_REWARDS=True
+OVERLONG_FILTERING=True
+
+# ========================== Loss function ==========================
+KL=0
+CLIP_MIN=0.2
+CLIP_MAX=0.28
+USE_ON_POLICY_KL_APPROXIMATION=True
+IMPORTANCE_SAMPLING_CORRECTION=True
+SEQ_LEVEL_IS=False
+TOKEN_LEVEL_LOSS=True
+
+# ============================ Optimizer ============================
+LR="1e-06"
+
+# =============================== MoE ===============================
+MOE_FREEZE_ROUTER=True
+MOE_PERMUTE_FUSION=True
+MOE_ENABLE_DEEPEP=False
+MOE_TOKEN_DISPATCHER_TYPE="alltoall"
+MOE_AUX_LOSS_COEFF=0
+MOE_ROUTER_LOAD_BALANCING_TYPE="none"
+MOE_ROUTER_BIAS_UPDATE_RATE="1e-3"
+
+# ======================= Generation / vLLM =======================
+TEMPERATURE=${TEMPERATURE:-1.0}
+
+# =================== Checkpointing & validation ===================
+SAVE_PERIOD=5
+VAL_PERIOD=1000
+KEEP_TOP_K=2
+
+# ============================ SWE agent ============================
+AGENT_MAX_TURNS=200
+AGENT_TIMEOUT=1800
+
+# ============================== Logging ==============================
+WANDB_PROJ="swe-benchmark"
+# Shared group for the whole generation-scaling sweep (compare runs by R).
+WANDB_GROUP="${WANDB_GROUP:-swe-gen-scale-linear}"
+# Log full trajectories to wandb so we can verify function_call items appear.
+LOG_GYM_RESPONSES=true
+
+# ========================= SLURM submission =========================
+SBATCH_ACCOUNT=${SBATCH_ACCOUNT:-nemotron_agents_dev}
+SBATCH_PARTITION=${SBATCH_PARTITION:-backfill}
+SBATCH_TIME="${SBATCH_TIME:-4:0:0}"
+# Optional smoke-test knob: cap training steps (appended as ++grpo.max_num_steps). Empty = use YAML default.
+MAX_NUM_STEPS="${MAX_NUM_STEPS:-}"
+
+# ========================= Experiment naming =========================
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  SYNC_MODE="async-age${MAX_TRAJECTORY_AGE_STEPS}"
+else
+  SYNC_MODE="sync"
+fi
+EXP_SUFFIX="${EXP_SUFFIX:-swe-genscale-${SYNC_MODE}-genrep${NUM_VLLM_REPLICAS}-nodes${TOTAL_NODES}-pps${PPS}-gpp${GPP}-gbs${GBS}-lr${LR}}"
+WANDB_NAME="${EXP_SUFFIX}"
+CHECKPOINT_DIR="${CHECKPOINT_ROOT}/${EXP_SUFFIX}"
+SNAPSHOT_DIR="${REPO_ROOT}"
+
+mkdir -p "${CHECKPOINT_DIR}"
+
+# ============= Unified SLURM/Ray log location =============
+export BASE_LOG_DIR="${BASE_LOG_DIR:-${SNAPSHOT_DIR}/logs/swe_bench_scale}"
+mkdir -p "${BASE_LOG_DIR}"
+
+# ========================= Environment variables =========================
+# Credentials are NOT sourced here. Export these yourself before submitting:
+#   HF_HOME, HF_TOKEN, WANDB_API_KEY  (and GITHUB_TOKEN / GITLAB_TOKEN if needed)
+export HUGGINGFACE_TOKEN="${HUGGINGFACE_TOKEN:-${HF_TOKEN}}"
+export GITLAB_TOKEN="${GITLAB_TOKEN:-}"
+export HF_DATASETS_CACHE="${HF_DATASETS_CACHE:-${HF_HOME}/datasets}"
+export UV_CACHE_DIR="${UV_CACHE_DIR:-/tmp/uv_cache}"  # sglang: set to /root/.cache/uv (baked prebuilt wheels) to skip ~40min compile
+# Safe TE persistence (option B, seed-style — NO /root/.cache/uv override, so ray is untouched):
+# the SETUP_COMMAND below rsyncs this Lustre seed (a harvested /tmp/uv_cache that already has the
+# compiled transformer-engine wheel) into /tmp/uv_cache before the run, so the COMMAND's uv finds
+# the prebuilt TE and skips the ~20-40min recompile. Empty seed => harmless (falls back to compile).
+export LUSTRE_UV_CACHE_SEED="${LUSTRE_UV_CACHE_SEED:-}"
+export UV_LOCK_TIMEOUT=3600
+export RAY_DEDUP_LOGS=1
+export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
+export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
+export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
+export OMP_NUM_THREADS=16
+
+# ========================= Node-local cache config =========================
+PERSISTENT_CACHE="${PERSISTENT_CACHE:-${HOME}/.cache/qwen3_30b_thinking_swe_scale}"
+export LUSTRE_VLLM_CACHE="${PERSISTENT_CACHE}/vllm_compile_cache"
+export LUSTRE_INDUCTOR_CACHE="${PERSISTENT_CACHE}/inductor_cache"
+export LUSTRE_TRITON_CACHE="${PERSISTENT_CACHE}/triton_cache"
+export NRL_VLLM_LOCAL_CACHE_DIR="/tmp/nemo_rl_vllm_cache"
+export NRL_VLLM_CACHE_SEED_DIR="/tmp/nemo_rl_vllm_cache_warm"
+export INDUCTOR_CACHE_DIR="/tmp/nemo_rl_inductor_cache"
+export TRITON_CACHE_DIR="/tmp/nemo_rl_triton_cache"
+export CACHE_SYNC_FREQUENCY=120
+mkdir -p "${LUSTRE_VLLM_CACHE}" "${LUSTRE_INDUCTOR_CACHE}" "${LUSTRE_TRITON_CACHE}"
+
+# ============================== Summary ==============================
+echo "=========================================="
+echo "SWE generation-scaling | Experiment: ${EXP_SUFFIX}"
+echo "Mode: ${SYNC_MODE}, Colocated: ${COLOCATED_ENABLED}"
+echo "wandb: project=${WANDB_PROJ}, group=${WANDB_GROUP}, name=${WANDB_NAME}"
+echo "------------------------------------------"
+echo "Scaling input:  NUM_VLLM_REPLICAS = ${NUM_VLLM_REPLICAS}  (R-step=${R_STEP})"
+echo "  replicas/node = ${REPLICAS_PER_NODE} (vllm_tp=${VLLM_TP})"
+echo "  GEN_NODES     = ${GEN_NODES}"
+echo "  TRAIN_NODES   = ${TRAIN_NODES}   (train_DP=${TRAIN_DP})"
+echo "  TOTAL_NODES   = ${TOTAL_NODES}"
+echo "  PPS           = ${PPS}"
+echo "  GPP           = ${GPP}"
+echo "  GBS           = ${GBS}"
+echo "  CONCURRENCY   = ${CONCURRENCY}"
+echo "  invariants    : samples/replica=${PER_REPLICA_SAMPLES}, batch/train-GPU=${PER_GPU_BATCH}"
+echo "Parallelism: TP=${TP}, EP=${EP}, CP=${CP}, PP=${PP}, vLLM_TP=${VLLM_TP}, pad=${MAKE_SEQ_DIVISIBLE_BY}"
+echo "Model: ${MODEL_PATH}"
+echo "Checkpoint: ${CHECKPOINT_DIR}"
+echo "=========================================="
+
+cd "${SNAPSHOT_DIR}"
+
+# ================ SETUP_COMMAND (bihu's: install apptainer + seed caches + uv sync) ================
+read -r -d '' SETUP_COMMAND <<SETUPEOF || true
+echo "[SETUP] Installing apptainer for SWE sandbox..."
+apt-get update && apt-get install -y git build-essential gcc wget 2>/dev/null || true
+RET=1
+RETRIES=3
+for attempt in \$(seq 1 \$RETRIES); do
+  if command -v apptainer >/dev/null 2>&1 || command -v singularity >/dev/null 2>&1; then
+    echo "[SETUP] singularity/apptainer already available"
+    RET=0
+    break
+  fi
+  cd /tmp && \
+  wget --no-check-certificate -q https://github.com/apptainer/apptainer/releases/download/v1.3.1/apptainer_1.3.1_amd64.deb && \
+  apt install -y ./apptainer_1.3.1_amd64.deb && \
+  ln -sf /usr/bin/apptainer /usr/bin/singularity
+  if command -v apptainer >/dev/null 2>&1; then
+    echo "[SETUP] apptainer installed successfully"
+    RET=0
+    break
+  fi
+  echo "[SETUP] apptainer install attempt \$attempt failed, retrying..."
+  sleep 10
+done
+if [ \$RET -ne 0 ]; then
+  echo "[SETUP] WARNING: apptainer installation failed after \$RETRIES attempts"
+fi
+
+echo "[CACHE SEED] Clearing stale /tmp caches and seeding from Lustre..."
+rm -rf /tmp/nemo_rl_vllm_cache /tmp/nemo_rl_vllm_cache_*
+rm -rf "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}"
+mkdir -p "${INDUCTOR_CACHE_DIR}" "${TRITON_CACHE_DIR}"
+
+find "${LUSTRE_INDUCTOR_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true
+find "${LUSTRE_TRITON_CACHE}" -maxdepth 1 -name '.tmp_*' -mmin +30 -exec rm -rf {} + 2>/dev/null || true
+
+_seed_cache() {
+  local lustre="\$1" local_dir="\$2" name="\$3"
+  if [ -d "\$lustre" ] && [ "\$(ls -A "\$lustre" 2>/dev/null)" ]; then
+    rsync -a --exclude '.tmp_*' "\$lustre/" "\$local_dir/" 2>/dev/null \
+      && echo "[CACHE SEED] \$name: seeded from Lustre" \
+      || echo "[CACHE SEED] \$name: seed failed (non-fatal)"
+  else
+    echo "[CACHE SEED] \$name: no warm cache on Lustre yet"
+  fi
+}
+
+_seed_cache "${LUSTRE_INDUCTOR_CACHE}" "${INDUCTOR_CACHE_DIR}" "Inductor"
+_seed_cache "${LUSTRE_TRITON_CACHE}" "${TRITON_CACHE_DIR}" "Triton"
+mkdir -p /tmp/uv_cache
+_seed_cache "${LUSTRE_UV_CACHE_SEED}" "/tmp/uv_cache" "uv (prebuilt transformer-engine)"
+echo "[CACHE SEED] Done."
+
+UV_HTTP_TIMEOUT=3600 \
+  uv sync --frozen --extra mcore
+SETUPEOF
+export SETUP_COMMAND
+
+# ================ Training command (bihu-style: uv run --frozen, no --extra mcore) ================
+# ===== backend-specific generation overrides (single-line; expanded into COMMAND) =====
+if [ "${BACKEND}" = "sglang" ]; then
+  GEN_BACKEND_OVERRIDES="++policy.generation.backend=sglang ++policy.generation.sglang_cfg.model_path=${MODEL_PATH} ++policy.generation.sglang_cfg.random_seed=42 ++policy.generation.sglang_cfg.dp_size=1 ++policy.generation.sglang_cfg.ep_size=1 ++policy.generation.sglang_cfg.pp_size=1 ++policy.generation.sglang_cfg.skip_server_warmup=true ++policy.generation.sglang_cfg.context_length=${SEQLEN} ++policy.generation.sglang_cfg.dtype=bfloat16 ++policy.generation.sglang_cfg.mem_fraction_static=0.55 ++policy.generation.sglang_cfg.disable_piecewise_cuda_graph=true ++policy.generation.sglang_cfg.disable_cuda_graph=${SGLANG_DISABLE_CUDA_GRAPH:-false} ++policy.generation.sglang_cfg.tool_call_parser=hermes ++policy.generation.sglang_cfg.reasoning_parser=qwen3-thinking ++policy.generation.sglang_cfg.chat_template=${SGLANG_CHAT_TEMPLATE} ++policy.generation.sglang_server.needs_offload=false ++policy.generation.sglang_server.cpu_weight_backup=false ++policy.generation.sglang_server.sglang_server_concurrency=${CONCURRENCY} ++policy.generation.sglang_server.pause_generation_mode=retract ++policy.generation.sglang_server.num_gpus=$(( GEN_NODES * NUM_GPU )) ++policy.generation.sglang_server.num_gpus_per_engine=${NUM_GPU} ++policy.generation.sglang_router.enabled=false ++env.nemo_gym.policy_model.responses_api_models.vllm_model.engine=sglang ++env.nemo_gym.policy_model.responses_api_models.vllm_model.sglang_chat_template_path=${SGLANG_CHAT_TEMPLATE} ++env.nemo_gym.policy_model.responses_api_models.vllm_model.sglang_max_total_sequence_length=${SEQLEN}"
+else
+  GEN_BACKEND_OVERRIDES="++policy.generation.backend=vllm policy.generation.vllm_cfg.tensor_parallel_size=${VLLM_TP} policy.generation.vllm_cfg.gpu_memory_utilization=${VLLM_GPU_UTIL} policy.generation.vllm_cfg.skip_tokenizer_init=False"
+fi
+
+export COMMAND="NRL_VLLM_USE_V1=1 \
+  NRL_REFIT_BUFFER_MEMORY_RATIO=0.018 \
+  NRL_WG_USE_RAY_REF=1 \
+  WANDB_API_KEY=${WANDB_API_KEY} \
+  WANDB_MODE=${WANDB_MODE} \
+  HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} \
+  GITHUB_TOKEN=${GITHUB_TOKEN} \
+  GITLAB_TOKEN=${GITLAB_TOKEN} \
+  HF_HOME=${HF_HOME} \
+  HF_DATASETS_CACHE=${HF_DATASETS_CACHE} \
+  UV_CACHE_DIR=${UV_CACHE_DIR} \
+  VLLM_ATTENTION_BACKEND=FLASH_ATTN \
+  VLLM_CACHE_ROOT=${LUSTRE_VLLM_CACHE} \
+  DG_JIT_CACHE_DIR=${LUSTRE_VLLM_CACHE}/deep_gemm \
+  VLLM_DEEP_GEMM_WARMUP=skip \
+  NRL_FORCE_REBUILD_VENVS=false \
+  NRL_IGNORE_VERSION_MISMATCH=1 \
+  RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 \
+  UV_HTTP_TIMEOUT=3600 \
+  UV_LOCK_TIMEOUT=900 \
+  TORCH_CUDA_ARCH_LIST='9.0 10.0' \
+  NEMO_GYM_SKIP_VENV_IF_PRESENT=1 \
+  uv run --frozen --extra mcore ./examples/nemo_gym/run_grpo_nemo_gym.py \
+  --config=${CONFIG_FILE} \
+  cluster.num_nodes=${TOTAL_NODES} \
+  cluster.gpus_per_node=${NUM_GPU} \
+  ++data.train.data_path=${TRAIN_DATA_PATH} \
+  ++data.validation.data_path=${VAL_DATA_PATH} \
+  grpo.num_prompts_per_step=${PPS} \
+  grpo.num_generations_per_prompt=${GPP} \
+  grpo.val_at_start=False \
+  grpo.normalize_rewards=${NORMALIZE_REWARDS} \
+  grpo.overlong_filtering=${OVERLONG_FILTERING} \
+  grpo.val_period=${VAL_PERIOD} \
+  grpo.seq_logprob_error_threshold=${SEQ_LOGPROB_ERROR_THRESHOLD} \
+  grpo.async_grpo.enabled=${ASYNC_GRPO_ENABLED} \
+  grpo.async_grpo.in_flight_weight_updates=${INFLIGHT_WEIGHT_UPDATE} \
+  grpo.async_grpo.recompute_kv_cache_after_weight_updates=${RECOMPUTE_KV_CACHE_AFTER_WEIGHT_UPDATES} \
+  grpo.async_grpo.max_trajectory_age_steps=${MAX_TRAJECTORY_AGE_STEPS} \
+  env.should_log_nemo_gym_responses=${LOG_GYM_RESPONSES} \
+  policy.generation.colocated.enabled=${COLOCATED_ENABLED} \
+  policy.model_name=${MODEL_PATH} \
+  policy.max_total_sequence_length=${SEQLEN} \
+  policy.dynamic_batching.enabled=False \
+  policy.train_global_batch_size=${GBS} \
+  policy.make_sequence_length_divisible_by=${MAKE_SEQ_DIVISIBLE_BY} \
+  policy.offload_optimizer_for_logprob=true \
+  policy.sequence_packing.enabled=${SEQUENCE_PACKING} \
+  policy.megatron_cfg.tensor_model_parallel_size=${TP} \
+  policy.megatron_cfg.expert_model_parallel_size=${EP} \
+  policy.megatron_cfg.context_parallel_size=${CP} \
+  policy.megatron_cfg.pipeline_model_parallel_size=${PP} \
+  policy.megatron_cfg.sequence_parallel=True \
+  policy.megatron_cfg.bias_activation_fusion=False \
+  policy.megatron_cfg.distributed_data_parallel_config.overlap_grad_reduce=${OVERLAP_GRAD_REDUCE} \
+  policy.megatron_cfg.moe_permute_fusion=${MOE_PERMUTE_FUSION} \
+  policy.megatron_cfg.moe_enable_deepep=${MOE_ENABLE_DEEPEP} \
+  policy.megatron_cfg.moe_token_dispatcher_type=${MOE_TOKEN_DISPATCHER_TYPE} \
+  policy.megatron_cfg.moe_aux_loss_coeff=${MOE_AUX_LOSS_COEFF} \
+  policy.megatron_cfg.moe_router_load_balancing_type=${MOE_ROUTER_LOAD_BALANCING_TYPE} \
+  policy.megatron_cfg.moe_router_bias_update_rate=${MOE_ROUTER_BIAS_UPDATE_RATE} \
+  policy.megatron_cfg.freeze_moe_router=${MOE_FREEZE_ROUTER} \
+  policy.megatron_cfg.optimizer.lr=${LR} \
+  policy.megatron_cfg.optimizer.min_lr=${LR} \
+  policy.megatron_cfg.optimizer.weight_decay=0 \
+  policy.megatron_cfg.empty_unused_memory_level=2 \
+  policy.megatron_cfg.activation_checkpointing=True \
+  policy.generation.temperature=${TEMPERATURE} \
+  ${GEN_BACKEND_OVERRIDES} \
+  loss_fn.reference_policy_kl_penalty=${KL} \
+  loss_fn.ratio_clip_min=${CLIP_MIN} \
+  loss_fn.ratio_clip_max=${CLIP_MAX} \
+  loss_fn.use_on_policy_kl_approximation=${USE_ON_POLICY_KL_APPROXIMATION} \
+  loss_fn.use_importance_sampling_correction=${IMPORTANCE_SAMPLING_CORRECTION} \
+  loss_fn.sequence_level_importance_ratios=${SEQ_LEVEL_IS} \
+  loss_fn.token_level_loss=${TOKEN_LEVEL_LOSS} \
+  loss_fn.force_on_policy_ratio=${FORCE_ON_POLICY_RATIO} \
+  checkpointing.checkpoint_dir=${CHECKPOINT_DIR} \
+  checkpointing.save_period=${SAVE_PERIOD} \
+  checkpointing.keep_top_k=${KEEP_TOP_K} \
+  ++checkpointing.metric_name=train:total_reward/mean \
+  ++checkpointing.checkpoint_must_save_by=00:03:35:00 \
+  logger.wandb_enabled=True \
+  logger.wandb.name=${WANDB_NAME} \
+  logger.wandb.project=${WANDB_PROJ} \
+  ++logger.wandb.group=${WANDB_GROUP}"
+
+if [ "${ASYNC_GRPO_ENABLED}" = "True" ]; then
+  export COMMAND="${COMMAND} \
+  policy.generation.colocated.resources.num_nodes=${GEN_NODES} \
+  policy.generation.colocated.resources.gpus_per_node=${NUM_GPU} \
+  grpo.advantage_clip_low=${ADVANTAGE_CLIP_LOW} \
+  grpo.advantage_clip_high=${ADVANTAGE_CLIP_HIGH} \
+  loss_fn.truncated_importance_sampling_ratio=${TIS_THRESHOLD} \
+  env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \
+  env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \
+  env.nemo_gym.swe_agents_train.responses_api_agents.swe_agents.concurrency=${CONCURRENCY} \
+  env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.agent_max_turns=${AGENT_MAX_TURNS} \
+  env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.swebench_agent_timeout=${AGENT_TIMEOUT} \
+  env.nemo_gym.swe_agents_val.responses_api_agents.swe_agents.concurrency=${CONCURRENCY}"
+fi
+
+# Optional: cap training steps (smoke test).
+if [ -n "${MAX_NUM_STEPS}" ]; then
+  export COMMAND="${COMMAND} grpo.max_num_steps=${MAX_NUM_STEPS}"
+fi
+
+# Generation-only benchmark: no-op training (no optimizer) + disable checkpoint saving.
+if [ "${SKIP_TRAINING}" = "1" ]; then
+  export COMMAND="${COMMAND} ++grpo.gen_benchmark_skip_training=true checkpointing.enabled=false"
+fi
+
+# ================ Submit job (skipped under DRY_RUN=1) ================
+if [ "${DRY_RUN:-0}" = "1" ]; then
+  echo ""
+  echo "[DRY_RUN] Not submitting. Would run:"
+  echo "[DRY_RUN] COMMAND:"; echo "$COMMAND" | tr ' ' '\n' | grep -E "backend=|sglang_|vllm_cfg" || true
+  echo "[DRY_RUN]   sbatch --nodes=${TOTAL_NODES} --account=${SBATCH_ACCOUNT} --partition=${SBATCH_PARTITION} --time=${SBATCH_TIME} --gres=gpu:${NUM_GPU} ... ray.sub"
+  cd - > /dev/null
+  exit 0
+fi
+
+# ===== PERSISTENT (idle Ray cluster) mode =====
+if [ "${PERSISTENT:-0}" = "1" ]; then
+  DRIVER_FILE="${REPO_ROOT}/swe2_driver_${EXP_SUFFIX}.cmd"
+  printf '%s\n' "${COMMAND}" > "${DRIVER_FILE}"
+  echo "[PERSISTENT] driver command saved -> ${DRIVER_FILE}"
+  export COMMAND=""
+fi
+
+sbatch \
+  --nodes="${TOTAL_NODES}" \
+  --account="${SBATCH_ACCOUNT}" \
+  --job-name="${WANDB_NAME}" \
+  --partition="${SBATCH_PARTITION}" \
+  --time="${SBATCH_TIME}" \
+  --gres=gpu:${NUM_GPU} \
+  --output="${BASE_LOG_DIR}/slurm-%j.out" \
+  --exclusive \
+  --dependency=singleton \
+  --comment='{"OccupiedIdleGPUsJobReaper":{"exemptIdleTimeMins":"180","reason":"data_loading","description":"Async GRPO SWE generation-scaling benchmark"}}' \
+  ray.sub | tee /dev/stderr | grep -o '[0-9]\+' > latest_scale_gen_job_id.txt
+
+JOB_ID="$(cat latest_scale_gen_job_id.txt)"
+echo "=========================================="
+echo "Job submitted: ${EXP_SUFFIX}"
+echo "Job ID: ${JOB_ID}"
+echo "wandb group: ${WANDB_GROUP}"
+echo "Monitor with: squeue -j ${JOB_ID}"
+echo "Ray/SLURM logs: ${BASE_LOG_DIR}/${JOB_ID}-logs/"
+echo "Checkpoints: ${CHECKPOINT_DIR}/"
+echo "=========================================="
+
+cd - > /dev/null