Skip to content

fix(runtimes): guard against nil NumNodes dereference in Flux plugin#3668

Open
AdeshDeshmukh wants to merge 1 commit into
kubeflow:masterfrom
AdeshDeshmukh:fix-flux-nil-numnodes-dereference
Open

fix(runtimes): guard against nil NumNodes dereference in Flux plugin#3668
AdeshDeshmukh wants to merge 1 commit into
kubeflow:masterfrom
AdeshDeshmukh:fix-flux-nil-numnodes-dereference

Conversation

@AdeshDeshmukh

@AdeshDeshmukh AdeshDeshmukh commented Jun 30, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Adds early validation in the Flux plugin's EnforceMLPolicy and Build entry points to prevent a nil pointer dereference panic when a user submits a TrainJob without spec.trainer or spec.trainer.numNodes.

The Flux plugin was the only runtime plugin missing nil checks on trainJob.Spec.Trainer before dereferencing its fields. All other runtime plugins (Torch, JAX, MPI, XGBoost, PlainML) guard with: trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumNodes != nil.

Previously, the two unprotected dereference sites at pkg/runtime/framework/plugins/flux/flux.go:414 and :452 would cause the controller to panic and crash-loop if a user omitted numNodes. Additionally, EnforceMLPolicy directly assigned trainJob.Spec.Trainer.Command and trainJob.Spec.Trainer.Args without a nil check.

This fix returns a clear error message instead of panicking, so the controller marks the TrainJob as Failed — standard Kubernetes controller behavior. Two regression tests verify both scenarios (nil Trainer and nil NumNodes).

Which issue(s) this PR fixes:
Fixes #3667

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings June 30, 2026 06:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Adesh Deshmukh <adeshkd123@gmail.com>
@AdeshDeshmukh AdeshDeshmukh force-pushed the fix-flux-nil-numnodes-dereference branch from 208e696 to 0f26c67 Compare June 30, 2026 06:03
Comment on lines +136 to +138
if trainJob.Spec.Trainer == nil || trainJob.Spec.Trainer.NumNodes == nil {
return fmt.Errorf("the Flux runtime requires spec.trainer and spec.trainer.numNodes to be set")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great catch @AdeshDeshmukh. We should actually set internal PodSet count based on Runtime or TrainJob value. See XGBoost plugin: https://github.com/AdeshDeshmukh/trainer/blob/0f26c67fea36c65480ef1478a6ee0a334f6645ed/pkg/runtime/framework/plugins/xgboost/xgboost.go#L87-L90

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @vsoch

@andreyvelich

Copy link
Copy Markdown
Member

/ok-to-test
/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flux plugin panics on nil NumNodes dereference

3 participants