feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657
feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657nikolauspschuetz wants to merge 1 commit into
Conversation
Distributed training is tightly coupled: every rank must stay available
for the job to make progress. A single voluntary eviction (node drain,
cluster-autoscaler scale-down) stalls the whole gang until it restarts
from the last checkpoint, and no PodDisruptionBudget was created anywhere
in the project.
Add a ComponentBuilderPlugin that, for gang-scheduled TrainJobs
(PodGroupPolicy set) with more than one trainer replica, builds a
policy/v1 PodDisruptionBudget with:
- minAvailable equal to the number of trainer replicas. Initializer
pods are excluded because they run to completion and stop being
Ready, which would otherwise make the PDB permanently unsatisfiable.
- a selector matching the TrainJob's JobSet pods
(jobset.sigs.k8s.io/jobset-name).
- a controller OwnerReference to the TrainJob for garbage collection.
The plugin also watches PodDisruptionBudgets so the controller reconciles
drift. This is plugin-only; no API or CRD change.
Fixes kubeflow#3304
Assisted-by: Claude Code (Anthropic)
Signed-off-by: Nikolaus Schuetz <nikolauspschuetz@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR adds a new runtime framework plugin that automatically creates and reconciles a Kubernetes policy/v1 PodDisruptionBudget (PDB) for gang-scheduled (PodGroupPolicy-enabled) multi-replica TrainJobs, preventing voluntary disruptions from evicting training pods mid-run.
Changes:
- Introduces a
PodDisruptionBudgetComponentBuilderPlugin+WatchExtensionPluginthat builds a PDB withminAvailable = trainer replica countand a selector keyed to the JobSet’s pods. - Adds unit and integration test coverage plus shared testing wrappers for constructing/validating PDB objects.
- Updates generated RBAC (manifests + Helm chart) to allow the controller to manage PDB resources.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/controller/trainjob_controller_test.go | Adds an integration assertion that a PDB is created for the gang-scheduled TrainJob. |
| pkg/util/testing/wrapper.go | Adds a PDB test wrapper to construct expected policy/v1 PDB objects in tests. |
| pkg/runtime/framework/plugins/registry.go | Registers the new pdb plugin in the framework plugin registry. |
| pkg/runtime/framework/plugins/pdb/pdb.go | Implements the PDB builder + watch plugin for gang-scheduled multi-replica TrainJobs. |
| pkg/runtime/framework/plugins/pdb/pdb_test.go | Adds table-driven unit tests validating PDB creation and replica counting behavior. |
| pkg/runtime/framework/core/framework_test.go | Updates framework tests to include the new plugin and its expected built objects. |
| pkg/runtime/core/trainingruntime_test.go | Updates runtime object-building tests to expect the new PDB output. |
| pkg/runtime/core/clustertrainingruntime_test.go | Updates cluster runtime object-building tests to expect the new PDB output. |
| manifests/base/rbac/role.yaml | Adds controller RBAC permissions for policy/poddisruptionbudgets (generated). |
| charts/kubeflow-trainer/templates/rbac/clusterrole.yaml | Adds Helm chart RBAC permissions for policy/poddisruptionbudgets (generated). |
What this PR does
Adds a
PodDisruptionBudgetComponentBuilderPluginthat protects gang-scheduled TrainJobs from voluntary disruptions (node drains, cluster-autoscaler scale-downs). For a TrainJob with aPodGroupPolicyand more than one trainer replica, it builds apolicy/v1PodDisruptionBudget withminAvailable= trainer-replica count, a selector on the JobSet's pods (jobset.sigs.k8s.io/jobset-name), and a controllerOwnerReferenceto the TrainJob. The plugin also watches PDBs so the controller reconciles drift. Plugin-only — no API/CRD change.Fixes #3304
Why
Distributed training is tightly coupled: every rank must stay available for the job to make progress. A single voluntary eviction stalls the whole gang until it restarts from the last checkpoint, and no PDB was created anywhere in the project.
Design notes / open questions for reviewers
PodGroupPolicy != nil(gang-scheduled jobs), symmetric with the coscheduling plugin. The issue text says "any distributed TrainJob" — I went narrower/conservative on purpose; happy to broaden to all multi-pod jobs if preferred.minAvailable= trainer replicas only (initializers excluded — they run to completion and stop being Ready, which would otherwise make the PDB permanently unsatisfiable). This matches the issue's "total number of training replicas."TrainJob/TrainingRuntimeAPI change) is still open — this PR takes the plugin-only path. Glad to add a config gate instead if maintainers prefer opt-in.Testing
pdb_test.go); updated framework/runtime build tests; integration assertion intrainjob_controller_test.go.go build ./...,go vet,gofmt, and allpkg/+cmd/unit tests pass. Integration/e2e run in CI (envtest not bootstrapped locally).controller-gen(manifests/base/rbac/role.yaml+ chartclusterrole.yaml).🤖 Assisted by Claude Code (Anthropic); authored, reviewed, and verified by me.