Skip to content

feat: add a Ray Jobs API execution backend#1479

Open
nicgupta-nvidia wants to merge 27 commits into
mainfrom
feat/ray-jobs-backend
Open

feat: add a Ray Jobs API execution backend#1479
nicgupta-nvidia wants to merge 27 commits into
mainfrom
feat/ray-jobs-backend

Conversation

@nicgupta-nvidia

@nicgupta-nvidia nicgupta-nvidia commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What

Add an additive Ray Jobs API execution backend to NeMo-Skills. With backend.name: ray plus a dashboard_url (Jobs API) or endpoint (Ray Client), a pipeline submits its commands to a pre-provisioned Ray cluster via JobSubmissionClient instead of launching through Slurm. It coexists with the existing executors — no behavior change when unused. Supports Kubernetes (KubeRay) and precreated clusters, optional entrypoint label selectors / image-derived placement labels, env-var forwarding into each job's runtime_env, and a per-experiment Ray Jobs queue with dependency ordering.

Design

  • New ExecutionBackend hook interface and get_execution_backend(cluster_config) resolver in pipeline/utils/backends.py; the default backend preserves existing Slurm/local behavior. Wired into cli.py and generate.py.
  • RayBackend (pipeline/utils/ray_backend.py) submits, tracks, and cancels jobs via JobSubmissionClient. Commands are queued on the experiment (_ns_ray_jobs_queue) and submitted/tracked/cancelled centrally in start_experiment, with bounded concurrency and cross-experiment dependency ordering.
  • Cluster-configured env (required_env_vars, etc.) is forwarded into each job's runtime_env.env_vars, so endpoint/judge API keys and HF_TOKEN reach jobs regardless of the Ray head's environment.
  • Explicit entrypoint_label_selector keys take precedence over image-derived label selectors (static-wins, matching the sibling image_label_key path).
  • Backwards compatible: with no backend configured, behavior is unchanged; the legacy with_ray flag resolves to the Ray backend.
  • Includes a sample precreated-Ray-on-Kubernetes cluster config under cluster_configs/.

Testing

tests/test_backends.py (16 tests): backend resolution across config shapes, env-var forwarding, image-label-selector precedence, the dependency resolver, and poll-failure cleanup. ruff check/format clean; DCO signed on all commits. The Jobs API submission path has been exercised end-to-end against a real pre-provisioned Ray cluster (hello-world submission, data-prep, SFT, and parallel eval fan-out).

Review feedback

Recent commits address review (CodeRabbit) findings:

  • Forward external run_after dependencies into the Ray queue (declarative.py, exp.py) so cross-experiment ordering is preserved.
  • Tighten the dependency resolver to require a proven SUCCEEDED for in-batch dependencies (no premature start) while still treating truly cross-experiment dependencies as gated upstream — resolving the earlier false "dependency deadlock" without regressing it.
  • Stop submitted jobs best-effort on a Jobs-API polling failure instead of leaving them running.
  • Treat dashboard_url as a precreated-cluster signal; stop downgrading an explicit precreated_cluster: true.
  • Narrow the experiment-lookup fallback to FileNotFoundError.
  • Move the sample preflight block to top-level config scope (where it is actually read).

Summary by CodeRabbit

  • New Features

    • Added support for running pipelines on a pre-created Ray cluster with Kubernetes-backed scheduling.
    • Introduced broader backend selection so jobs can run through the appropriate execution path automatically.
    • Added a new sample configuration for SLURM-based runs with Ray support.
  • Bug Fixes

    • Improved handling of task dependencies across experiments and within Ray-backed runs.
    • Fixed dry-run and backend-routing behavior so task planning matches the selected execution mode.
    • Improved cleanup and status tracking when Ray jobs fail or are stopped.

Add an ExecutionBackend abstraction plus a RayBackend that submits
pipeline stages to a pre-provisioned Ray cluster over the HTTP Ray Jobs
API, selected via cluster_config (backend.name: ray + dashboard_url).

- backends.py (new): ExecutionBackend interface + get_execution_backend()
  resolver (default/none/ray + kubernetes-ray aliases + legacy with_ray
  compat), selector/metadata helpers.
- ray_backend.py (new): RayBackend - dashboard-url normalization,
  preflight checks, concurrent dependency-ordered Jobs-API submission,
  manifest.jsonl + per-job logs.
- exp.py: backend env-overrides + ray-precreated gate in get_executor;
  add_task stage_metadata + queue_ray_job_commands; populate
  command_images at each command site (fixes dead image-label selection);
  run_exp backend delegation.
- declarative.py: resolve the backend and queue Ray Jobs on the RL
  Pipeline path (used by the rollout/verify GRPO stages).
- __init__.py: export backend symbols.
- tests/test_backends.py (new): selector/metadata unit tests.
- cluster_configs/example-slurm-ray-k8s-precreated.yaml (new): reference.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
generate.py: detect backend.name=ray and thread with_ray into the
Pipeline; keep sequential mode off when running via the Ray backend.
cli.py: guard the optional nemo_evaluator import so non-evaluator
commands stay usable when its launcher dep is absent.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ence over image selectors

The image_label_selectors branch used dict.update(), which overwrote
keys that were explicitly set via entrypoint_label_selector with
image-derived values. Switched to setdefault so explicitly-set static
selector keys win over image-derived defaults, consistent with the
image_label_key branch.

Fixes test_image_label_selectors_prefer_static_selector_keys.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
When stages reuse one Experiment (exp.add(name="nemo-run")), a job's
dependency names refer to nemo-run task handles, but completion was
tracked only by task_name, so handle-named deps pointing at a
prior/reused experiment were never satisfied and the resolver raised a
false deadlock. Stamp each job's predicted nemo-run handle at queue time
and record completion under both keys; treat deps on an already-finished
prior experiment as satisfied (a genuinely failed dep still raises).

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…_env

Jobs were submitted with no runtime_env, so each inherited only the
cluster head's environment; endpoint credentials configured for the
cluster (NVIDIA_API_KEY / OPENAI_API_KEY for a judge, HF_TOKEN) never
reached the job unless present in the head's env. Resolve the cluster env
via get_env_variables() and pass it as runtime_env.env_vars on submit.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
@nicgupta-nvidia nicgupta-nvidia self-assigned this Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds a pluggable ExecutionBackend, implements a RayBackend using the Ray Jobs API (with preflight, dependency-aware submission, polling, and stop), wires backend-driven staging and Ray queueing into pipeline generation and execution, guards an optional CLI import, supplies a SLURM+precreated-Ray example config, and adds backend tests.

Changes

Ray and SLURM backend execution refactoring

Layer / File(s) Summary
Backend contract and base implementation
nemo_skills/pipeline/utils/backends.py, nemo_skills/pipeline/utils/__init__.py
Defines backend config normalization, BackendRunOptions, ExecutionBackend interface (stage_metadata, get_env_overrides, start_experiment, track_experiment, stop_experiment), and helper functions get_execution_backend, track_stage_tasks, and stop_stage_tasks.
RayBackend concurrent job submission and tracking
nemo_skills/pipeline/utils/ray_backend.py
Implements RayBackend with Ray Jobs API submission via ThreadPoolExecutor, dependency-aware ready-queue, per-job polling and log streaming, JSONL manifest logging, preflight validation, and track/stop implementations.
Pipeline generation and backend detection
nemo_skills/pipeline/generate.py
Detects backend from cluster_config (supports backend/execution_backend), sets with_ray_pipeline when backend is ray, passes it into Pipeline, and updates sequential logic accordingly.
Executor and task scheduling backend integration
nemo_skills/pipeline/utils/exp.py
Merges backend env overrides into executor env, validates Ray precreated-cluster mode, tracks per-command container images, stages metadata via backend, introduces queue_ray_job_commands() to queue per-command Ray job entries, and routes run_exp through the selected backend with BackendRunOptions.
Declarative pipeline metadata and Ray command queueing
nemo_skills/pipeline/utils/declarative.py
Initializes shared_packager, tracks a single metadata_container_image, replaces with_ray/slurm-only metadata selection with backend-driven staging via get_execution_backend(), and invokes queue_ray_job_commands() with collected commands/images/deps.
Experiment execution orchestration via backend
nemo_skills/pipeline/utils/exp.py (run_exp function)
Constructs backend and BackendRunOptions, calls backend.start_experiment() for SLURM and non-SLURM flows, supports dry-run early return, and preserves SLURM-only code-caching logic.
CLI optional dependency handling and example configuration
nemo_skills/pipeline/cli.py, cluster_configs/example-slurm-ray-k8s-precreated.yaml
Guards nemo_evaluator import with try/except to avoid hard failures; adds example YAML for SLURM executor running tasks on a pre-created Ray cluster on Kubernetes (containers, SLURM fields, mounts, env_vars, optional preflight/selector guidance).
Backend configuration and Ray environment tests
tests/test_backends.py
Adds tests for Kubernetes+Ray label selector translation and precedence, backend aliasing, image_label_selectors merging, env var forwarding into Ray runtime env, and runtime env normalization behavior.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA-NeMo/Skills#1453: Modifies Ray-related add_task dependency/queueing logic that overlaps this PR's backend-driven Ray job command/dependency wiring.
  • NVIDIA-NeMo/Skills#1435: Related Ray execution wiring changes and prior work toward routing tasks through Ray-specific flows.
  • NVIDIA-NeMo/Skills#1381: Prior edits to Ray/sandbox task execution wiring in get_executor/add_task that intersect with this PR's changes.

Suggested reviewers

  • gwarmstrong
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main feature addition: a Ray Jobs API execution backend for the pipeline system.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ray-jobs-backend

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/exp.py (1)

245-268: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the resolved backend for the multi-node Ray gate.

get_execution_backend() already infers Ray precreated-cluster mode from endpoint, ray_endpoint, Kubernetes aliases, and endpoint presence. This local re-parse only accepts name == "ray" plus an explicit precreated_cluster=true, so valid configs can still fail the non-SLURM multi-node check.

Suggested fix
-    env_vars = get_env_variables(cluster_config)
-    backend_env = get_execution_backend(cluster_config, with_ray=with_ray).get_env_overrides()
+    env_vars = get_env_variables(cluster_config)
+    backend = get_execution_backend(cluster_config, with_ray=with_ray)
+    backend_env = backend.get_env_overrides()
     if backend_env:
         env_vars.update(backend_env)
@@
-    backend_config = cluster_config.get("backend") or cluster_config.get("execution_backend") or {}
-    if isinstance(backend_config, str):
-        backend_config = {"name": backend_config}
-    backend_name = str(backend_config.get("name", "")).strip().lower()
-    is_ray_precreated = backend_name == "ray" and bool(backend_config.get("precreated_cluster", False))
-    if is_ray_precreated and not backend_config.get("endpoint"):
+    is_ray_precreated = getattr(backend, "name", "") == "ray" and bool(
+        getattr(backend, "precreated_cluster", False)
+    )
+    if is_ray_precreated and not getattr(backend, "endpoint", None):
         raise ValueError(
             "Invalid cluster_config: backend.precreated_cluster=true requires backend.endpoint to be set."
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/exp.py` around lines 245 - 268, The multi-node Ray
gate should use the resolved execution backend instead of re-parsing
cluster_config; capture the result of get_execution_backend(cluster_config,
with_ray=with_ray) (the same object used to get_env_overrides) and ask that
backend whether it's a Ray precreated cluster (e.g., use backend.name /
backend.is_precreated or backend.has_endpoint semantics provided by that object)
when computing is_ray_precreated, and remove the fragile local backend_config
string-parse; ensure the prior backend object is used for the
endpoint/precreated detection so valid configs inferred by get_execution_backend
won't be rejected.
🧹 Nitpick comments (1)
tests/test_backends.py (1)

149-160: ⚡ Quick win

Add a regression assertion for the dashboard-only backend path.

These tests already build the backend.name: ray + dashboard_url configuration, but they only inspect _build_runtime_env(). Please also assert that backend.stage_metadata() does not enable an embedded Ray cluster for this setup, so the pre-provisioned-cluster contract stays covered.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_backends.py` around lines 149 - 160, The test
test_ray_backend_forwards_required_env_vars_to_runtime_env only checks
runtime_env but must also assert that a dashboard-only Ray config does not
enable an embedded cluster; after creating backend =
get_execution_backend(cluster_config) and before/after calling
_build_runtime_env(), call backend.stage_metadata() and assert that it does not
request an embedded cluster (e.g., ensure stage_metadata() either lacks keys
like "create_embedded_cluster"/"provision_cluster" or returns metadata
indicating pre_provisioned/dashboard_only), so add an assertion referencing
backend.stage_metadata() to confirm no embedded Ray cluster is enabled for the
provided dashboard_url-only config.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cluster_configs/example-slurm-ray-k8s-precreated.yaml`:
- Around line 84-103: The preflight YAML block is placed under backend scope but
ray_backend.py expects cluster_config.preflight at the top level; move the
entire preflight block out from under the backend mapping to the top-level of
the cluster config (same indentation as backend), preserving keys (enabled,
require_ray_endpoint, strict_label_check, required_node_labels,
min_cluster_resources) so that ray_backend.py will find cluster_config.preflight
correctly. Ensure indentation and YAML structure remain valid and merge with any
existing top-level keys rather than nesting under backend.

In `@nemo_skills/pipeline/utils/backends.py`:
- Around line 199-259: The precreated-cluster signal currently only checks
endpoint and can get overwritten (e.g., in the control_plane=="kubernetes"
branch), so treat dashboard_url as a valid precreated-cluster indicator: update
the precreated_cluster computation to include bool(dashboard_url) (or
backend_config.get("precreated_cluster") flag) and use that precreated_cluster
variable in all RayBackend constructions (replace literal bool(endpoint) uses in
the control_plane=="kubernetes" and in the "kubernetes-ray"/"ray-kubernetes"
branch). Reference: the precreated_cluster variable at the top of the branch,
the control_plane=="kubernetes" RayBackend call, and the RayBackend return in
the ray-kubernetes branch.
- Around line 282-285: The current try/except around
run.Experiment.from_title(exp_or_name) is too broad and swallows real errors;
narrow it to only catch the “experiment not found” case (match the behavior in
get_exp_handles) by catching FileNotFoundError (or the specific not-found
exception raised by run.Experiment.from_title) and then calling
run.Experiment.from_id(exp_or_name); re-raise any other exceptions so unexpected
errors aren’t masked. Reference: the try block using run.Experiment.from_title
and the fallback to run.Experiment.from_id.

In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 925-933: The call to queue_ray_job_commands is only passing
internal_deps, dropping external_deps derived from run_after/external experiment
refs; update the invocation in declarative.py to include external_deps by
combining them with internal_deps (e.g., extend or create task_dependencies =
internal_deps + external_deps) and pass that combined list to the
task_dependencies parameter of queue_ray_job_commands so Ray Jobs API ordering
(via _ns_ray_jobs_queue) respects external pipeline dependencies as well.

In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 915-925: The Ray queueing call to queue_ray_job_commands is not
receiving the resolved external experiment ordering (run_after/dependencies), so
queued jobs can start before prerequisite experiments finish; update the call
site around queue_ray_job_commands to forward the resolved external dependency
info (e.g. pass run_after or the resolved dependencies list) into the queued
task dependency parameter (task_dependencies) or add dependencies to
task_dependencies before calling queue_ray_job_commands so the Ray Jobs API
honors experiment ordering (refer to symbols queue_ray_job_commands, run_after,
dependencies, task_dependencies, and start_experiment).

In `@nemo_skills/pipeline/utils/ray_backend.py`:
- Around line 570-637: In _deps_satisfied, stop treating any dep not in
producible_dep_names as automatically satisfied; instead require explicit proof
of prior completion: for each dep in dep_task_names, return True only if
completed.get(d) == _SUCCESS_STATE, and if d is not in producible_dep_names but
is present in completed and equals _SUCCESS_STATE treat it as satisfied;
otherwise consider the dependency unresolved and raise a clear
missing-dependency error (e.g., ValueError or a dedicated
MissingDependencyError) including the job/task name and missing dep name so the
caller can fail the batch rather than silently proceed. Ensure references to
producible_dep_names, completed, _SUCCESS_STATE, and _deps_satisfied are used to
locate and implement this check.
- Around line 579-624: _poll_until_done currently calls client.get_job_status()
without exception handling; wrap the get_job_status() call in a try/except,
catch Exception as e, and on error call self._stop_jobs_best_effort(...) to
best-effort stop any in-flight jobs for the current submission/job (use
submission_id and job_id as available), then re-raise the exception so upstream
(_submit_jobs_concurrently / done_future.result()) fails consistently; ensure
you still convert the status via self._to_status_str when no exception occurs.

---

Outside diff comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 245-268: The multi-node Ray gate should use the resolved execution
backend instead of re-parsing cluster_config; capture the result of
get_execution_backend(cluster_config, with_ray=with_ray) (the same object used
to get_env_overrides) and ask that backend whether it's a Ray precreated cluster
(e.g., use backend.name / backend.is_precreated or backend.has_endpoint
semantics provided by that object) when computing is_ray_precreated, and remove
the fragile local backend_config string-parse; ensure the prior backend object
is used for the endpoint/precreated detection so valid configs inferred by
get_execution_backend won't be rejected.

---

Nitpick comments:
In `@tests/test_backends.py`:
- Around line 149-160: The test
test_ray_backend_forwards_required_env_vars_to_runtime_env only checks
runtime_env but must also assert that a dashboard-only Ray config does not
enable an embedded cluster; after creating backend =
get_execution_backend(cluster_config) and before/after calling
_build_runtime_env(), call backend.stage_metadata() and assert that it does not
request an embedded cluster (e.g., ensure stage_metadata() either lacks keys
like "create_embedded_cluster"/"provision_cluster" or returns metadata
indicating pre_provisioned/dashboard_only), so add an assertion referencing
backend.stage_metadata() to confirm no embedded Ray cluster is enabled for the
provided dashboard_url-only config.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0ced282f-6126-4320-8c35-2fa53987dbc1

📥 Commits

Reviewing files that changed from the base of the PR and between 24deb16 and 495836e.

📒 Files selected for processing (9)
  • cluster_configs/example-slurm-ray-k8s-precreated.yaml
  • nemo_skills/pipeline/cli.py
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/utils/__init__.py
  • nemo_skills/pipeline/utils/backends.py
  • nemo_skills/pipeline/utils/declarative.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/ray_backend.py
  • tests/test_backends.py

Comment thread cluster_configs/example-slurm-ray-k8s-precreated.yaml Outdated
Comment thread nemo_skills/pipeline/utils/backends.py Outdated
Comment thread nemo_skills/pipeline/utils/backends.py
Comment thread nemo_skills/pipeline/utils/declarative.py
Comment thread nemo_skills/pipeline/utils/exp.py
Comment thread nemo_skills/pipeline/utils/ray_backend.py Outdated
Comment thread nemo_skills/pipeline/utils/ray_backend.py
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Cross-experiment dependencies (resolved run_after handles) were dropped
when queuing Ray Jobs, so a job could submit before a prerequisite in
another experiment finished. Forward them alongside same-experiment
task_dependencies from both the add_task and declarative pipeline paths.

Tighten the queue's dependency resolver so a dependency produced by a
job in the current batch is satisfied only once that job reaches
SUCCEEDED (tracked by both task_name and nemo-run handle), preventing a
dependent from starting early. A dependency that matches no job in the
batch references a prior or cross-experiment job already gated by the
prerequisite experiment's blocking submission, so it stays treated as
satisfied and the resolver does not deadlock on a job it cannot observe.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…okup

Resolve precreated_cluster as the OR of an explicit precreated_cluster
flag, a Ray Client endpoint, and a dashboard_url, and pass that into
every RayBackend construction (the ray branch, its kubernetes
sub-branch, and the kubernetes-ray alias). The dashboard-only path now
stages as a precreated cluster instead of requesting an embedded one,
and an explicit precreated_cluster: true is no longer downgraded.

Narrow the experiment-lookup fallback to catch only FileNotFoundError
from Experiment.from_title so unrelated errors propagate instead of
silently resolving by id, matching get_exp_handles.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
The commented preflight example was nested under backend:, but preflight
is read from the top level of the cluster config. Dedent it to a sibling
of backend: so uncommenting it takes effect, and note the scope inline.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Extract the in-batch dependency-name set and the dependency predicate into
pure static methods (_compute_in_batch_dep_names, _deps_satisfied) and the
poll-failure cleanup into _handle_poll_failure, then call them from
_submit_jobs_concurrently. Behavior is unchanged; this only makes the logic
importable for unit tests.

Add tests covering: the dep-name set includes both handles and task_names
and ignores jobs missing them; an in-batch dependency blocks dependents
until it reaches SUCCEEDED (under task_name or handle); a dependency matching
no job in the batch is treated as satisfied; mixed in-batch/cross-batch deps
stay blocked until the in-batch one succeeds; empty/none deps are satisfied;
poll-failure cleanup stops submitted jobs once and chains the original error;
and best-effort stop keeps going when one stop call raises.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
… the multi-node gate

stage_metadata() and the non-Slurm multi-node guard treated a precreated cluster as reachable only via an endpoint, so a dashboard_url-only (Jobs API) config still requested an embedded Ray cluster and could fail the multi-node check. Use the resolved backend and accept dashboard_url alongside endpoint in both places. Add a regression test that a dashboard_url-only config resolves as precreated and does not request an embedded cluster.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/ray_backend.py (1)

389-395: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve use_with_ray_cluster outside the precreated-cluster special case.

This branch now hardcodes embedded Ray metadata on for every non-precreated configuration, so callers that intentionally pass use_with_ray_cluster=False are ignored.

Suggested fix
-        if self.precreated_cluster and (self.endpoint or self.dashboard_url):
-            should_use_embedded_ray_cluster = False
-        else:
-            should_use_embedded_ray_cluster = True
+        if self.precreated_cluster and (self.endpoint or self.dashboard_url):
+            should_use_embedded_ray_cluster = False
+        else:
+            should_use_embedded_ray_cluster = use_with_ray_cluster

As per coding guidelines, "Avoid cases when user-passed parameters are unused!"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/ray_backend.py` around lines 389 - 395, The code
currently overrides the caller's use_with_ray_cluster by unconditionally setting
should_use_embedded_ray_cluster=True for non-precreated clusters; instead
preserve the incoming use_with_ray_cluster unless the precreated-cluster special
case requires forcing it off. Update the logic around the call to
super().stage_metadata so that should_use_embedded_ray_cluster is initialized
from the method parameter use_with_ray_cluster and only set to False when
self.precreated_cluster and (self.endpoint or self.dashboard_url) are true, then
pass that variable into super().stage_metadata (symbols: use_with_ray_cluster
parameter, should_use_embedded_ray_cluster, super().stage_metadata,
self.precreated_cluster, self.endpoint, self.dashboard_url, metadata).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nemo_skills/pipeline/utils/ray_backend.py`:
- Around line 389-395: The code currently overrides the caller's
use_with_ray_cluster by unconditionally setting
should_use_embedded_ray_cluster=True for non-precreated clusters; instead
preserve the incoming use_with_ray_cluster unless the precreated-cluster special
case requires forcing it off. Update the logic around the call to
super().stage_metadata so that should_use_embedded_ray_cluster is initialized
from the method parameter use_with_ray_cluster and only set to False when
self.precreated_cluster and (self.endpoint or self.dashboard_url) are true, then
pass that variable into super().stage_metadata (symbols: use_with_ray_cluster
parameter, should_use_embedded_ray_cluster, super().stage_metadata,
self.precreated_cluster, self.endpoint, self.dashboard_url, metadata).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff21699c-1283-47d5-8dc9-8a3225bb9d5c

📥 Commits

Reviewing files that changed from the base of the PR and between ec420c3 and 6a0dd18.

📒 Files selected for processing (3)
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/ray_backend.py
  • tests/test_backends.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • nemo_skills/pipeline/utils/exp.py
  • tests/test_backends.py

… metadata

The non-precreated branch of stage_metadata hardcoded the embedded-cluster flag on, ignoring the caller's use_with_ray_cluster parameter. Initialize it from the parameter and only force it off for a precreated cluster reachable by endpoint or dashboard_url. Behavior is unchanged for the Ray callers (which pass True). Add a test that the flag is honored.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Docstrings only; no behavior, signature, or import changes.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
A dashboard-only Jobs API cluster (name=ray + dashboard_url, no endpoint) is a valid precreated cluster, but preflight required an endpoint whenever precreated_cluster was set. Accept endpoint OR dashboard_url (matching stage_metadata and the multi-node gate); reachability checks default to whether an endpoint exists and are skipped when only dashboard_url is configured.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
A run_after naming another experiment whose tasks already finished resolves to empty handles and falls through as a bare experiment-name string in internal_deps (the _reuse_exp path). nemo-run's exp.add asserts every dependency is a job in the current experiment, raising "Dependency <name> not found" on the Mode-3 declarative eval path. Filter internal_deps against the experiment's own job ids (only when exp.jobs is a concrete list) before exp.add; cross-experiment ordering is still honored via external deps and the Ray queue resolver. Adds a regression test that simulates exp.jobs growth -- the prior dependency test mocked get_exp_handles to always return non-empty, hiding this path.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
_RayBackendShim was defined then immediately deleted and never referenced; RayBackend is already re-exported via the import and __all__. Remove the dead class and the misleading placeholder comment.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…_ENV

A Ray job whose driver re-inits Ray with its own runtime_env (NeMo-RL GRPO/rollout via run_grpo_nemo_gym.py -> init_ray) fails with "Failed to merge the Job's runtime env ... because of a conflict" when a key (OPENAI_API_KEY, HF_HOME, ...) is present in both the job-level runtime_env we forward and the driver's ray.init runtime_env. Always set RAY_OVERRIDE_JOB_RUNTIME_ENV=1 in the submitted runtime_env so Ray merges (driver wins) instead of erroring. No effect on jobs that never re-init Ray.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ME_ENV

The runtime_env now always carries RAY_OVERRIDE_JOB_RUNTIME_ENV=1, so the normalize/filter test expects it in env_vars and the no-env case asserts the override dict instead of None.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…shboard_url

A 'backend.name: ray' config with 'executor: none' targets a pre-provisioned Ray
Jobs cluster and needs a dashboard URL to submit to. When none resolved,
get_execution_backend silently treated it as a non-precreated (Ray-inside-Slurm)
backend, which then failed deep in nemo-run with the opaque
'use_with_ray_cluster is only supported for SlurmExecutor'. Raise an actionable
error up front pointing at backend.dashboard_url. Scoped to executor: none so the
valid Ray-inside-Slurm path (executor: slurm, no dashboard) is unaffected.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ckend

For executor='none', _rewrite_local_paths rewrote /nemo_run/code to the
driver's absolute nemo_skills install dir (e.g.
/usr/local/lib/python3.10/dist-packages). That is correct for true
run-on-this-host execution, but under the Ray Jobs API backend the command
is submitted to a remote cluster whose image and Python version need not
match the driver's, so that absolute path does not exist there -- the
eval-generation job fails with "cd: /usr/local/lib/python3.10/dist-packages:
No such file or directory".

When backend.name == "ray", rewrite to relative (cwd) paths instead,
mirroring the add_task path (which rewrites /nemo_run/code -> "./"). The job
then runs from its working directory and nemo_skills resolves via the baked
image's PYTHONPATH. Non-Ray behavior is unchanged.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…pported

Ray's JobSubmissionClient.submit_job() gained entrypoint_label_selector in
Ray 2.55; older clients (e.g. Ray 2.54 in the nemo-rl image) raise TypeError
when it is passed, even as None. Build the submit_job kwargs incrementally and
include entrypoint_label_selector only when a selector is configured and the
installed Ray client's signature supports it (otherwise log a warning and
proceed). The common no-selector path now works against any Ray version the
cluster runs, while label routing is preserved on Ray >= 2.55.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I focused this review specifically on whether the new Ray backend stays off the non-Ray code path (i.e., no behavior change for users who never opt in). Mostly it does — no hard runtime dependency on ray is added (the ray / JobSubmissionClient imports are all lazy), and the default-backend resolution faithfully reproduces old behavior in most spots. But a few ray-motivated refactors leaked onto the shared path.

Requesting changes — a few of which are blocking: they are genuine behavior changes on the non-Ray path. See the inline comments for specifics.

Comment thread nemo_skills/pipeline/utils/exp.py
Comment thread nemo_skills/pipeline/utils/declarative.py Outdated
Comment thread nemo_skills/pipeline/cli.py Outdated
Comment thread nemo_skills/pipeline/utils/declarative.py Outdated
Comment thread nemo_skills/pipeline/utils/exp.py Outdated
Comment thread nemo_skills/pipeline/utils/exp.py Outdated
Comment thread nemo_skills/pipeline/utils/backends.py Outdated
Addresses review feedback that a few Ray-motivated refactors changed behavior
for users who never opt into Ray. Each fix is gated so the default /
Slurm-without-Ray path matches origin/main.

- run_exp: honor --dry_run before the live mount check (it opens an SSH tunnel
  and stats sources), so a dry run stays offline-safe on every non-Ray caller.
- declarative: scope the finished-cross-experiment internal_deps drop to the Ray
  backend; the default backend keeps nemo-run's fail-loud "Dependency not found"
  instead of silently dropping a dependency.
- cli: drop the unrelated nemo_evaluator ModuleNotFoundError guard (reverts to
  the plain import; out of scope for this PR).
- add_task / declarative: gate use_with_ray_cluster on executor == "slurm" so
  legacy with_ray on a non-Slurm executor no longer requests an embedded Ray
  cluster (nemo-run asserts SlurmExecutor). Precreated Jobs-API clusters never
  embed -- RayBackend.stage_metadata already enforces that.
- _rewrite_local_paths / generate: parse the backend name through a shared
  get_backend_name() helper so the supported bare-string form (backend: ray) no
  longer raises AttributeError, and the duplicated parsing is removed.
- add_task: only resolve command_images when a Ray backend is active (the list
  is consumed solely by the Ray queue); avoids a docker inspect per command on
  the non-Ray path for dockerfile: specs.
- backends: import RayBackend lazily (only when a Ray backend is selected) via a
  local import + PEP 562 __getattr__, so the default path no longer imports the
  ~900-line ray_backend module. TYPE_CHECKING keeps the __all__ re-export valid.

Tests: add regression coverage for the dry-run guard, the ray-vs-default
cross-experiment dep handling, with_ray+non-Slurm metadata, Ray-only
command-image resolution, the backend-name helpers, and the lazy import; reframe
the existing cross-exp dep test as the Ray-backend case.

Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Autofix skipped. No unresolved CodeRabbit review comments with fix instructions found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants