fix(runtimes): guard against nil NumNodes dereference in Flux plugin#3668
fix(runtimes): guard against nil NumNodes dereference in Flux plugin#3668AdeshDeshmukh wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Adesh Deshmukh <adeshkd123@gmail.com>
208e696 to
0f26c67
Compare
| if trainJob.Spec.Trainer == nil || trainJob.Spec.Trainer.NumNodes == nil { | ||
| return fmt.Errorf("the Flux runtime requires spec.trainer and spec.trainer.numNodes to be set") | ||
| } |
There was a problem hiding this comment.
This is great catch @AdeshDeshmukh. We should actually set internal PodSet count based on Runtime or TrainJob value. See XGBoost plugin: https://github.com/AdeshDeshmukh/trainer/blob/0f26c67fea36c65480ef1478a6ee0a334f6645ed/pkg/runtime/framework/plugins/xgboost/xgboost.go#L87-L90
|
/ok-to-test |
What this PR does / why we need it:
Adds early validation in the Flux plugin's
EnforceMLPolicyandBuildentry points to prevent a nil pointer dereference panic when a user submits a TrainJob withoutspec.trainerorspec.trainer.numNodes.The Flux plugin was the only runtime plugin missing nil checks on
trainJob.Spec.Trainerbefore dereferencing its fields. All other runtime plugins (Torch, JAX, MPI, XGBoost, PlainML) guard with:trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.NumNodes != nil.Previously, the two unprotected dereference sites at
pkg/runtime/framework/plugins/flux/flux.go:414and:452would cause the controller to panic and crash-loop if a user omittednumNodes. Additionally,EnforceMLPolicydirectly assignedtrainJob.Spec.Trainer.CommandandtrainJob.Spec.Trainer.Argswithout a nil check.This fix returns a clear error message instead of panicking, so the controller marks the TrainJob as Failed — standard Kubernetes controller behavior. Two regression tests verify both scenarios (nil
Trainerand nilNumNodes).Which issue(s) this PR fixes:
Fixes #3667
Checklist: