Skip to content

fix(operator): surface runtime resource reconciliation errors in TrainJob status#3672

Open
AdeshDeshmukh wants to merge 1 commit into
kubeflow:masterfrom
AdeshDeshmukh:fix-resource-reconciliation-status
Open

fix(operator): surface runtime resource reconciliation errors in TrainJob status#3672
AdeshDeshmukh wants to merge 1 commit into
kubeflow:masterfrom
AdeshDeshmukh:fix-resource-reconciliation-status

Conversation

@AdeshDeshmukh

Copy link
Copy Markdown

What this PR does / why we need it:

When reconcileObjects() fails during TrainJob reconciliation (either during runtime.NewObjects() or the SSA client.Apply()), the error was previously only emitted as a Kubernetes Event via r.recorder.Eventf(). The TrainJob status was not updated to reflect the failure, leaving users unable to diagnose the issue via kubectl get trainjob — the job would appear stuck in a non-terminal state (e.g., "Pending") with no indication of what went wrong.

Changes made:

  1. New condition reason constant (pkg/apis/trainer/v1alpha1/trainjob_types.go): Added TrainJobResourcesCreationFailedReason = "ResourcesCreationFailed" following the existing convention for condition reason constants.

  2. Surface error in TrainJob status (pkg/controller/trainjob_controller.go): Call the existing setFailedCondition() helper when reconcileObjects() fails, mirroring the established pattern used for unsupported runtimes (TrainingRuntimeNotSupported) and deadline exceeded (DeadlineExceeded). The Kubernetes Event is preserved for real-time notification.

  3. Fix latent overwrite bug in setTrainJobStatus (pkg/controller/trainjob_controller.go): Updated setTrainJobStatus() to generically preserve any pre-existing Failed condition instead of using a fragile reason-specific whitelist (which only preserved DeadlineExceeded). In a partial SSA apply failure, the runtime could return a non-nil status that would silently overwrite and erase the Failed condition. This fix ensures the condition survives regardless of the reason.

  4. Unit test (pkg/controller/trainjob_controller_test.go): Added test coverage for setTrainJobStatus() verifying that Failed conditions with ResourcesCreationFailed and DeadlineExceeded reasons are preserved with both nil and non-nil runtime status returns.

Resolves the TODO left by astefanutti at trainjob_controller.go:125.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #3671

Checklist:

  • Docs included if any changes are user facing

…nJob status

Signed-off-by: Adesh Deshmukh <adeshkd123@gmail.com>
Copilot AI review requested due to automatic review settings June 30, 2026 09:30
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TrainJob status not updated when runtime resource reconciliation fails

2 participants