feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs by nikolauspschuetz · Pull Request #3657 · kubeflow/trainer

nikolauspschuetz · 2026-06-29T01:47:45Z

What this PR does

Adds a PodDisruptionBudget ComponentBuilderPlugin that protects gang-scheduled TrainJobs from voluntary disruptions (node drains, cluster-autoscaler scale-downs). For a TrainJob with a PodGroupPolicy and more than one trainer replica, it builds a policy/v1 PodDisruptionBudget with minAvailable = trainer-replica count, a selector on the JobSet's pods (jobset.sigs.k8s.io/jobset-name), and a controller OwnerReference to the TrainJob. The plugin also watches PDBs so the controller reconciles drift. Plugin-only — no API/CRD change.

Fixes #3304

Why

Distributed training is tightly coupled: every rank must stay available for the job to make progress. A single voluntary eviction stalls the whole gang until it restarts from the last checkpoint, and no PDB was created anywhere in the project.

Design notes / open questions for reviewers

Trigger: gated on PodGroupPolicy != nil (gang-scheduled jobs), symmetric with the coscheduling plugin. The issue text says "any distributed TrainJob" — I went narrower/conservative on purpose; happy to broaden to all multi-pod jobs if preferred.
minAvailable = trainer replicas only (initializers excluded — they run to completion and stop being Ready, which would otherwise make the PDB permanently unsatisfiable). This matches the issue's "total number of training replicas."
@tariq-hasan's question on the issue (plugin vs. TrainJob/TrainingRuntime API change) is still open — this PR takes the plugin-only path. Glad to add a config gate instead if maintainers prefer opt-in.

Testing

Table-driven unit tests for the plugin (pdb_test.go); updated framework/runtime build tests; integration assertion in trainjob_controller_test.go.
Locally: go build ./..., go vet, gofmt, and all pkg/+cmd/ unit tests pass. Integration/e2e run in CI (envtest not bootstrapped locally).
RBAC regenerated via controller-gen (manifests/base/rbac/role.yaml + chart clusterrole.yaml).

🤖 Assisted by Claude Code (Anthropic); authored, reviewed, and verified by me.

Distributed training is tightly coupled: every rank must stay available for the job to make progress. A single voluntary eviction (node drain, cluster-autoscaler scale-down) stalls the whole gang until it restarts from the last checkpoint, and no PodDisruptionBudget was created anywhere in the project. Add a ComponentBuilderPlugin that, for gang-scheduled TrainJobs (PodGroupPolicy set) with more than one trainer replica, builds a policy/v1 PodDisruptionBudget with: - minAvailable equal to the number of trainer replicas. Initializer pods are excluded because they run to completion and stop being Ready, which would otherwise make the PDB permanently unsatisfiable. - a selector matching the TrainJob's JobSet pods (jobset.sigs.k8s.io/jobset-name). - a controller OwnerReference to the TrainJob for garbage collection. The plugin also watches PodDisruptionBudgets so the controller reconciles drift. This is plugin-only; no API or CRD change. Fixes kubeflow#3304 Assisted-by: Claude Code (Anthropic) Signed-off-by: Nikolaus Schuetz <nikolauspschuetz@gmail.com>

google-oss-prow · 2026-06-29T01:47:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-06-29T01:47:52Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR adds a new runtime framework plugin that automatically creates and reconciles a Kubernetes policy/v1 PodDisruptionBudget (PDB) for gang-scheduled (PodGroupPolicy-enabled) multi-replica TrainJobs, preventing voluntary disruptions from evicting training pods mid-run.

Changes:

Introduces a PodDisruptionBudget ComponentBuilderPlugin + WatchExtensionPlugin that builds a PDB with minAvailable = trainer replica count and a selector keyed to the JobSet’s pods.
Adds unit and integration test coverage plus shared testing wrappers for constructing/validating PDB objects.
Updates generated RBAC (manifests + Helm chart) to allow the controller to manage PDB resources.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
test/integration/controller/trainjob_controller_test.go	Adds an integration assertion that a PDB is created for the gang-scheduled TrainJob.
pkg/util/testing/wrapper.go	Adds a PDB test wrapper to construct expected `policy/v1` PDB objects in tests.
pkg/runtime/framework/plugins/registry.go	Registers the new `pdb` plugin in the framework plugin registry.
pkg/runtime/framework/plugins/pdb/pdb.go	Implements the PDB builder + watch plugin for gang-scheduled multi-replica TrainJobs.
pkg/runtime/framework/plugins/pdb/pdb_test.go	Adds table-driven unit tests validating PDB creation and replica counting behavior.
pkg/runtime/framework/core/framework_test.go	Updates framework tests to include the new plugin and its expected built objects.
pkg/runtime/core/trainingruntime_test.go	Updates runtime object-building tests to expect the new PDB output.
pkg/runtime/core/clustertrainingruntime_test.go	Updates cluster runtime object-building tests to expect the new PDB output.
manifests/base/rbac/role.yaml	Adds controller RBAC permissions for `policy/poddisruptionbudgets` (generated).
charts/kubeflow-trainer/templates/rbac/clusterrole.yaml	Adds Helm chart RBAC permissions for `policy/poddisruptionbudgets` (generated).

Copilot AI review requested due to automatic review settings June 29, 2026 01:47

google-oss-prow Bot requested review from akshaychitneni and kuizhiqing June 29, 2026 01:47

google-oss-prow Bot added the size/L label Jun 29, 2026

Copilot started reviewing on behalf of nikolauspschuetz June 29, 2026 01:48 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

nikolauspschuetz mentioned this pull request Jun 29, 2026

Automatic PodDisruptionBudget for distributed TrainJobs #3304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657

feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657
nikolauspschuetz wants to merge 1 commit into
kubeflow:masterfrom
nikolauspschuetz:feat/auto-pdb-gang-trainjob

nikolauspschuetz commented Jun 29, 2026

Uh oh!

google-oss-prow Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nikolauspschuetz commented Jun 29, 2026

What this PR does

Why

Design notes / open questions for reviewers

Testing

Uh oh!

google-oss-prow Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants