feat: add a Ray Jobs API execution backend#1479
Conversation
Add an ExecutionBackend abstraction plus a RayBackend that submits pipeline stages to a pre-provisioned Ray cluster over the HTTP Ray Jobs API, selected via cluster_config (backend.name: ray + dashboard_url). - backends.py (new): ExecutionBackend interface + get_execution_backend() resolver (default/none/ray + kubernetes-ray aliases + legacy with_ray compat), selector/metadata helpers. - ray_backend.py (new): RayBackend - dashboard-url normalization, preflight checks, concurrent dependency-ordered Jobs-API submission, manifest.jsonl + per-job logs. - exp.py: backend env-overrides + ray-precreated gate in get_executor; add_task stage_metadata + queue_ray_job_commands; populate command_images at each command site (fixes dead image-label selection); run_exp backend delegation. - declarative.py: resolve the backend and queue Ray Jobs on the RL Pipeline path (used by the rollout/verify GRPO stages). - __init__.py: export backend symbols. - tests/test_backends.py (new): selector/metadata unit tests. - cluster_configs/example-slurm-ray-k8s-precreated.yaml (new): reference. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
generate.py: detect backend.name=ray and thread with_ray into the Pipeline; keep sequential mode off when running via the Ray backend. cli.py: guard the optional nemo_evaluator import so non-evaluator commands stay usable when its launcher dep is absent. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ence over image selectors The image_label_selectors branch used dict.update(), which overwrote keys that were explicitly set via entrypoint_label_selector with image-derived values. Switched to setdefault so explicitly-set static selector keys win over image-derived defaults, consistent with the image_label_key branch. Fixes test_image_label_selectors_prefer_static_selector_keys. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
When stages reuse one Experiment (exp.add(name="nemo-run")), a job's dependency names refer to nemo-run task handles, but completion was tracked only by task_name, so handle-named deps pointing at a prior/reused experiment were never satisfied and the resolver raised a false deadlock. Stamp each job's predicted nemo-run handle at queue time and record completion under both keys; treat deps on an already-finished prior experiment as satisfied (a genuinely failed dep still raises). Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…_env Jobs were submitted with no runtime_env, so each inherited only the cluster head's environment; endpoint credentials configured for the cluster (NVIDIA_API_KEY / OPENAI_API_KEY for a judge, HF_TOKEN) never reached the job unless present in the head's env. Resolve the cluster env via get_env_variables() and pass it as runtime_env.env_vars on submit. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a pluggable ExecutionBackend, implements a RayBackend using the Ray Jobs API (with preflight, dependency-aware submission, polling, and stop), wires backend-driven staging and Ray queueing into pipeline generation and execution, guards an optional CLI import, supplies a SLURM+precreated-Ray example config, and adds backend tests. ChangesRay and SLURM backend execution refactoring
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/exp.py (1)
245-268:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the resolved backend for the multi-node Ray gate.
get_execution_backend()already infers Ray precreated-cluster mode fromendpoint,ray_endpoint, Kubernetes aliases, and endpoint presence. This local re-parse only acceptsname == "ray"plus an explicitprecreated_cluster=true, so valid configs can still fail the non-SLURM multi-node check.Suggested fix
- env_vars = get_env_variables(cluster_config) - backend_env = get_execution_backend(cluster_config, with_ray=with_ray).get_env_overrides() + env_vars = get_env_variables(cluster_config) + backend = get_execution_backend(cluster_config, with_ray=with_ray) + backend_env = backend.get_env_overrides() if backend_env: env_vars.update(backend_env) @@ - backend_config = cluster_config.get("backend") or cluster_config.get("execution_backend") or {} - if isinstance(backend_config, str): - backend_config = {"name": backend_config} - backend_name = str(backend_config.get("name", "")).strip().lower() - is_ray_precreated = backend_name == "ray" and bool(backend_config.get("precreated_cluster", False)) - if is_ray_precreated and not backend_config.get("endpoint"): + is_ray_precreated = getattr(backend, "name", "") == "ray" and bool( + getattr(backend, "precreated_cluster", False) + ) + if is_ray_precreated and not getattr(backend, "endpoint", None): raise ValueError( "Invalid cluster_config: backend.precreated_cluster=true requires backend.endpoint to be set." )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/utils/exp.py` around lines 245 - 268, The multi-node Ray gate should use the resolved execution backend instead of re-parsing cluster_config; capture the result of get_execution_backend(cluster_config, with_ray=with_ray) (the same object used to get_env_overrides) and ask that backend whether it's a Ray precreated cluster (e.g., use backend.name / backend.is_precreated or backend.has_endpoint semantics provided by that object) when computing is_ray_precreated, and remove the fragile local backend_config string-parse; ensure the prior backend object is used for the endpoint/precreated detection so valid configs inferred by get_execution_backend won't be rejected.
🧹 Nitpick comments (1)
tests/test_backends.py (1)
149-160: ⚡ Quick winAdd a regression assertion for the dashboard-only backend path.
These tests already build the
backend.name: ray+dashboard_urlconfiguration, but they only inspect_build_runtime_env(). Please also assert thatbackend.stage_metadata()does not enable an embedded Ray cluster for this setup, so the pre-provisioned-cluster contract stays covered.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_backends.py` around lines 149 - 160, The test test_ray_backend_forwards_required_env_vars_to_runtime_env only checks runtime_env but must also assert that a dashboard-only Ray config does not enable an embedded cluster; after creating backend = get_execution_backend(cluster_config) and before/after calling _build_runtime_env(), call backend.stage_metadata() and assert that it does not request an embedded cluster (e.g., ensure stage_metadata() either lacks keys like "create_embedded_cluster"/"provision_cluster" or returns metadata indicating pre_provisioned/dashboard_only), so add an assertion referencing backend.stage_metadata() to confirm no embedded Ray cluster is enabled for the provided dashboard_url-only config.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cluster_configs/example-slurm-ray-k8s-precreated.yaml`:
- Around line 84-103: The preflight YAML block is placed under backend scope but
ray_backend.py expects cluster_config.preflight at the top level; move the
entire preflight block out from under the backend mapping to the top-level of
the cluster config (same indentation as backend), preserving keys (enabled,
require_ray_endpoint, strict_label_check, required_node_labels,
min_cluster_resources) so that ray_backend.py will find cluster_config.preflight
correctly. Ensure indentation and YAML structure remain valid and merge with any
existing top-level keys rather than nesting under backend.
In `@nemo_skills/pipeline/utils/backends.py`:
- Around line 199-259: The precreated-cluster signal currently only checks
endpoint and can get overwritten (e.g., in the control_plane=="kubernetes"
branch), so treat dashboard_url as a valid precreated-cluster indicator: update
the precreated_cluster computation to include bool(dashboard_url) (or
backend_config.get("precreated_cluster") flag) and use that precreated_cluster
variable in all RayBackend constructions (replace literal bool(endpoint) uses in
the control_plane=="kubernetes" and in the "kubernetes-ray"/"ray-kubernetes"
branch). Reference: the precreated_cluster variable at the top of the branch,
the control_plane=="kubernetes" RayBackend call, and the RayBackend return in
the ray-kubernetes branch.
- Around line 282-285: The current try/except around
run.Experiment.from_title(exp_or_name) is too broad and swallows real errors;
narrow it to only catch the “experiment not found” case (match the behavior in
get_exp_handles) by catching FileNotFoundError (or the specific not-found
exception raised by run.Experiment.from_title) and then calling
run.Experiment.from_id(exp_or_name); re-raise any other exceptions so unexpected
errors aren’t masked. Reference: the try block using run.Experiment.from_title
and the fallback to run.Experiment.from_id.
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 925-933: The call to queue_ray_job_commands is only passing
internal_deps, dropping external_deps derived from run_after/external experiment
refs; update the invocation in declarative.py to include external_deps by
combining them with internal_deps (e.g., extend or create task_dependencies =
internal_deps + external_deps) and pass that combined list to the
task_dependencies parameter of queue_ray_job_commands so Ray Jobs API ordering
(via _ns_ray_jobs_queue) respects external pipeline dependencies as well.
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 915-925: The Ray queueing call to queue_ray_job_commands is not
receiving the resolved external experiment ordering (run_after/dependencies), so
queued jobs can start before prerequisite experiments finish; update the call
site around queue_ray_job_commands to forward the resolved external dependency
info (e.g. pass run_after or the resolved dependencies list) into the queued
task dependency parameter (task_dependencies) or add dependencies to
task_dependencies before calling queue_ray_job_commands so the Ray Jobs API
honors experiment ordering (refer to symbols queue_ray_job_commands, run_after,
dependencies, task_dependencies, and start_experiment).
In `@nemo_skills/pipeline/utils/ray_backend.py`:
- Around line 570-637: In _deps_satisfied, stop treating any dep not in
producible_dep_names as automatically satisfied; instead require explicit proof
of prior completion: for each dep in dep_task_names, return True only if
completed.get(d) == _SUCCESS_STATE, and if d is not in producible_dep_names but
is present in completed and equals _SUCCESS_STATE treat it as satisfied;
otherwise consider the dependency unresolved and raise a clear
missing-dependency error (e.g., ValueError or a dedicated
MissingDependencyError) including the job/task name and missing dep name so the
caller can fail the batch rather than silently proceed. Ensure references to
producible_dep_names, completed, _SUCCESS_STATE, and _deps_satisfied are used to
locate and implement this check.
- Around line 579-624: _poll_until_done currently calls client.get_job_status()
without exception handling; wrap the get_job_status() call in a try/except,
catch Exception as e, and on error call self._stop_jobs_best_effort(...) to
best-effort stop any in-flight jobs for the current submission/job (use
submission_id and job_id as available), then re-raise the exception so upstream
(_submit_jobs_concurrently / done_future.result()) fails consistently; ensure
you still convert the status via self._to_status_str when no exception occurs.
---
Outside diff comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 245-268: The multi-node Ray gate should use the resolved execution
backend instead of re-parsing cluster_config; capture the result of
get_execution_backend(cluster_config, with_ray=with_ray) (the same object used
to get_env_overrides) and ask that backend whether it's a Ray precreated cluster
(e.g., use backend.name / backend.is_precreated or backend.has_endpoint
semantics provided by that object) when computing is_ray_precreated, and remove
the fragile local backend_config string-parse; ensure the prior backend object
is used for the endpoint/precreated detection so valid configs inferred by
get_execution_backend won't be rejected.
---
Nitpick comments:
In `@tests/test_backends.py`:
- Around line 149-160: The test
test_ray_backend_forwards_required_env_vars_to_runtime_env only checks
runtime_env but must also assert that a dashboard-only Ray config does not
enable an embedded cluster; after creating backend =
get_execution_backend(cluster_config) and before/after calling
_build_runtime_env(), call backend.stage_metadata() and assert that it does not
request an embedded cluster (e.g., ensure stage_metadata() either lacks keys
like "create_embedded_cluster"/"provision_cluster" or returns metadata
indicating pre_provisioned/dashboard_only), so add an assertion referencing
backend.stage_metadata() to confirm no embedded Ray cluster is enabled for the
provided dashboard_url-only config.
🪄 Autofix (Beta)
❌ Autofix failed (check again to retry)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0ced282f-6126-4320-8c35-2fa53987dbc1
📒 Files selected for processing (9)
cluster_configs/example-slurm-ray-k8s-precreated.yamlnemo_skills/pipeline/cli.pynemo_skills/pipeline/generate.pynemo_skills/pipeline/utils/__init__.pynemo_skills/pipeline/utils/backends.pynemo_skills/pipeline/utils/declarative.pynemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/ray_backend.pytests/test_backends.py
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Cross-experiment dependencies (resolved run_after handles) were dropped when queuing Ray Jobs, so a job could submit before a prerequisite in another experiment finished. Forward them alongside same-experiment task_dependencies from both the add_task and declarative pipeline paths. Tighten the queue's dependency resolver so a dependency produced by a job in the current batch is satisfied only once that job reaches SUCCEEDED (tracked by both task_name and nemo-run handle), preventing a dependent from starting early. A dependency that matches no job in the batch references a prior or cross-experiment job already gated by the prerequisite experiment's blocking submission, so it stays treated as satisfied and the resolver does not deadlock on a job it cannot observe. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…okup Resolve precreated_cluster as the OR of an explicit precreated_cluster flag, a Ray Client endpoint, and a dashboard_url, and pass that into every RayBackend construction (the ray branch, its kubernetes sub-branch, and the kubernetes-ray alias). The dashboard-only path now stages as a precreated cluster instead of requesting an embedded one, and an explicit precreated_cluster: true is no longer downgraded. Narrow the experiment-lookup fallback to catch only FileNotFoundError from Experiment.from_title so unrelated errors propagate instead of silently resolving by id, matching get_exp_handles. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
The commented preflight example was nested under backend:, but preflight is read from the top level of the cluster config. Dedent it to a sibling of backend: so uncommenting it takes effect, and note the scope inline. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Extract the in-batch dependency-name set and the dependency predicate into pure static methods (_compute_in_batch_dep_names, _deps_satisfied) and the poll-failure cleanup into _handle_poll_failure, then call them from _submit_jobs_concurrently. Behavior is unchanged; this only makes the logic importable for unit tests. Add tests covering: the dep-name set includes both handles and task_names and ignores jobs missing them; an in-batch dependency blocks dependents until it reaches SUCCEEDED (under task_name or handle); a dependency matching no job in the batch is treated as satisfied; mixed in-batch/cross-batch deps stay blocked until the in-batch one succeeds; empty/none deps are satisfied; poll-failure cleanup stops submitted jobs once and chains the original error; and best-effort stop keeps going when one stop call raises. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
… the multi-node gate stage_metadata() and the non-Slurm multi-node guard treated a precreated cluster as reachable only via an endpoint, so a dashboard_url-only (Jobs API) config still requested an embedded Ray cluster and could fail the multi-node check. Use the resolved backend and accept dashboard_url alongside endpoint in both places. Add a regression test that a dashboard_url-only config resolves as precreated and does not request an embedded cluster. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/ray_backend.py (1)
389-395:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
use_with_ray_clusteroutside the precreated-cluster special case.This branch now hardcodes embedded Ray metadata on for every non-precreated configuration, so callers that intentionally pass
use_with_ray_cluster=Falseare ignored.Suggested fix
- if self.precreated_cluster and (self.endpoint or self.dashboard_url): - should_use_embedded_ray_cluster = False - else: - should_use_embedded_ray_cluster = True + if self.precreated_cluster and (self.endpoint or self.dashboard_url): + should_use_embedded_ray_cluster = False + else: + should_use_embedded_ray_cluster = use_with_ray_clusterAs per coding guidelines, "Avoid cases when user-passed parameters are unused!"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/utils/ray_backend.py` around lines 389 - 395, The code currently overrides the caller's use_with_ray_cluster by unconditionally setting should_use_embedded_ray_cluster=True for non-precreated clusters; instead preserve the incoming use_with_ray_cluster unless the precreated-cluster special case requires forcing it off. Update the logic around the call to super().stage_metadata so that should_use_embedded_ray_cluster is initialized from the method parameter use_with_ray_cluster and only set to False when self.precreated_cluster and (self.endpoint or self.dashboard_url) are true, then pass that variable into super().stage_metadata (symbols: use_with_ray_cluster parameter, should_use_embedded_ray_cluster, super().stage_metadata, self.precreated_cluster, self.endpoint, self.dashboard_url, metadata).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@nemo_skills/pipeline/utils/ray_backend.py`:
- Around line 389-395: The code currently overrides the caller's
use_with_ray_cluster by unconditionally setting
should_use_embedded_ray_cluster=True for non-precreated clusters; instead
preserve the incoming use_with_ray_cluster unless the precreated-cluster special
case requires forcing it off. Update the logic around the call to
super().stage_metadata so that should_use_embedded_ray_cluster is initialized
from the method parameter use_with_ray_cluster and only set to False when
self.precreated_cluster and (self.endpoint or self.dashboard_url) are true, then
pass that variable into super().stage_metadata (symbols: use_with_ray_cluster
parameter, should_use_embedded_ray_cluster, super().stage_metadata,
self.precreated_cluster, self.endpoint, self.dashboard_url, metadata).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ff21699c-1283-47d5-8dc9-8a3225bb9d5c
📒 Files selected for processing (3)
nemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/ray_backend.pytests/test_backends.py
🚧 Files skipped from review as they are similar to previous changes (2)
- nemo_skills/pipeline/utils/exp.py
- tests/test_backends.py
… metadata The non-precreated branch of stage_metadata hardcoded the embedded-cluster flag on, ignoring the caller's use_with_ray_cluster parameter. Initialize it from the parameter and only force it off for a precreated cluster reachable by endpoint or dashboard_url. Behavior is unchanged for the Ray callers (which pass True). Add a test that the flag is honored. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Docstrings only; no behavior, signature, or import changes. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
A dashboard-only Jobs API cluster (name=ray + dashboard_url, no endpoint) is a valid precreated cluster, but preflight required an endpoint whenever precreated_cluster was set. Accept endpoint OR dashboard_url (matching stage_metadata and the multi-node gate); reachability checks default to whether an endpoint exists and are skipped when only dashboard_url is configured. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
A run_after naming another experiment whose tasks already finished resolves to empty handles and falls through as a bare experiment-name string in internal_deps (the _reuse_exp path). nemo-run's exp.add asserts every dependency is a job in the current experiment, raising "Dependency <name> not found" on the Mode-3 declarative eval path. Filter internal_deps against the experiment's own job ids (only when exp.jobs is a concrete list) before exp.add; cross-experiment ordering is still honored via external deps and the Ray queue resolver. Adds a regression test that simulates exp.jobs growth -- the prior dependency test mocked get_exp_handles to always return non-empty, hiding this path. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
_RayBackendShim was defined then immediately deleted and never referenced; RayBackend is already re-exported via the import and __all__. Remove the dead class and the misleading placeholder comment. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…_ENV A Ray job whose driver re-inits Ray with its own runtime_env (NeMo-RL GRPO/rollout via run_grpo_nemo_gym.py -> init_ray) fails with "Failed to merge the Job's runtime env ... because of a conflict" when a key (OPENAI_API_KEY, HF_HOME, ...) is present in both the job-level runtime_env we forward and the driver's ray.init runtime_env. Always set RAY_OVERRIDE_JOB_RUNTIME_ENV=1 in the submitted runtime_env so Ray merges (driver wins) instead of erroring. No effect on jobs that never re-init Ray. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ME_ENV The runtime_env now always carries RAY_OVERRIDE_JOB_RUNTIME_ENV=1, so the normalize/filter test expects it in env_vars and the no-env case asserts the override dict instead of None. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…shboard_url A 'backend.name: ray' config with 'executor: none' targets a pre-provisioned Ray Jobs cluster and needs a dashboard URL to submit to. When none resolved, get_execution_backend silently treated it as a non-precreated (Ray-inside-Slurm) backend, which then failed deep in nemo-run with the opaque 'use_with_ray_cluster is only supported for SlurmExecutor'. Raise an actionable error up front pointing at backend.dashboard_url. Scoped to executor: none so the valid Ray-inside-Slurm path (executor: slurm, no dashboard) is unaffected. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…ckend For executor='none', _rewrite_local_paths rewrote /nemo_run/code to the driver's absolute nemo_skills install dir (e.g. /usr/local/lib/python3.10/dist-packages). That is correct for true run-on-this-host execution, but under the Ray Jobs API backend the command is submitted to a remote cluster whose image and Python version need not match the driver's, so that absolute path does not exist there -- the eval-generation job fails with "cd: /usr/local/lib/python3.10/dist-packages: No such file or directory". When backend.name == "ray", rewrite to relative (cwd) paths instead, mirroring the add_task path (which rewrites /nemo_run/code -> "./"). The job then runs from its working directory and nemo_skills resolves via the baked image's PYTHONPATH. Non-Ray behavior is unchanged. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
…pported Ray's JobSubmissionClient.submit_job() gained entrypoint_label_selector in Ray 2.55; older clients (e.g. Ray 2.54 in the nemo-rl image) raise TypeError when it is passed, even as None. Build the submit_job kwargs incrementally and include entrypoint_label_selector only when a selector is configured and the installed Ray client's signature supports it (otherwise log a warning and proceed). The common no-selector path now works against any Ray version the cluster runs, while label routing is preserved on Ray >= 2.55. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I focused this review specifically on whether the new Ray backend stays off the non-Ray code path (i.e., no behavior change for users who never opt in). Mostly it does — no hard runtime dependency on ray is added (the ray / JobSubmissionClient imports are all lazy), and the default-backend resolution faithfully reproduces old behavior in most spots. But a few ray-motivated refactors leaked onto the shared path.
Requesting changes — a few of which are blocking: they are genuine behavior changes on the non-Ray path. See the inline comments for specifics.
Addresses review feedback that a few Ray-motivated refactors changed behavior for users who never opt into Ray. Each fix is gated so the default / Slurm-without-Ray path matches origin/main. - run_exp: honor --dry_run before the live mount check (it opens an SSH tunnel and stats sources), so a dry run stays offline-safe on every non-Ray caller. - declarative: scope the finished-cross-experiment internal_deps drop to the Ray backend; the default backend keeps nemo-run's fail-loud "Dependency not found" instead of silently dropping a dependency. - cli: drop the unrelated nemo_evaluator ModuleNotFoundError guard (reverts to the plain import; out of scope for this PR). - add_task / declarative: gate use_with_ray_cluster on executor == "slurm" so legacy with_ray on a non-Slurm executor no longer requests an embedded Ray cluster (nemo-run asserts SlurmExecutor). Precreated Jobs-API clusters never embed -- RayBackend.stage_metadata already enforces that. - _rewrite_local_paths / generate: parse the backend name through a shared get_backend_name() helper so the supported bare-string form (backend: ray) no longer raises AttributeError, and the duplicated parsing is removed. - add_task: only resolve command_images when a Ray backend is active (the list is consumed solely by the Ray queue); avoids a docker inspect per command on the non-Ray path for dockerfile: specs. - backends: import RayBackend lazily (only when a Ray backend is selected) via a local import + PEP 562 __getattr__, so the default path no longer imports the ~900-line ray_backend module. TYPE_CHECKING keeps the __all__ re-export valid. Tests: add regression coverage for the dry-run guard, the ray-vs-default cross-experiment dep handling, with_ray+non-Slurm metadata, Ray-only command-image resolution, the backend-name helpers, and the lazy import; reframe the existing cross-exp dep test as the Ray-backend case. Signed-off-by: Nick Gupta <nicgupta@nvidia.com>
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Autofix skipped. No unresolved CodeRabbit review comments with fix instructions found. |
What
Add an additive Ray Jobs API execution backend to NeMo-Skills. With
backend.name: rayplus adashboard_url(Jobs API) orendpoint(Ray Client), a pipeline submits its commands to a pre-provisioned Ray cluster viaJobSubmissionClientinstead of launching through Slurm. It coexists with the existing executors — no behavior change when unused. Supports Kubernetes (KubeRay) and precreated clusters, optional entrypoint label selectors / image-derived placement labels, env-var forwarding into each job'sruntime_env, and a per-experiment Ray Jobs queue with dependency ordering.Design
ExecutionBackendhook interface andget_execution_backend(cluster_config)resolver inpipeline/utils/backends.py; the default backend preserves existing Slurm/local behavior. Wired intocli.pyandgenerate.py.RayBackend(pipeline/utils/ray_backend.py) submits, tracks, and cancels jobs viaJobSubmissionClient. Commands are queued on the experiment (_ns_ray_jobs_queue) and submitted/tracked/cancelled centrally instart_experiment, with bounded concurrency and cross-experiment dependency ordering.required_env_vars, etc.) is forwarded into each job'sruntime_env.env_vars, so endpoint/judge API keys andHF_TOKENreach jobs regardless of the Ray head's environment.entrypoint_label_selectorkeys take precedence over image-derived label selectors (static-wins, matching the siblingimage_label_keypath).with_rayflag resolves to the Ray backend.cluster_configs/.Testing
tests/test_backends.py(16 tests): backend resolution across config shapes, env-var forwarding, image-label-selector precedence, the dependency resolver, and poll-failure cleanup.ruff check/formatclean; DCO signed on all commits. The Jobs API submission path has been exercised end-to-end against a real pre-provisioned Ray cluster (hello-world submission, data-prep, SFT, and parallel eval fan-out).Review feedback
Recent commits address review (CodeRabbit) findings:
run_afterdependencies into the Ray queue (declarative.py,exp.py) so cross-experiment ordering is preserved.SUCCEEDEDfor in-batch dependencies (no premature start) while still treating truly cross-experiment dependencies as gated upstream — resolving the earlier false "dependency deadlock" without regressing it.dashboard_urlas a precreated-cluster signal; stop downgrading an explicitprecreated_cluster: true.FileNotFoundError.preflightblock to top-level config scope (where it is actually read).Summary by CodeRabbit
New Features
Bug Fixes