Skip to content

feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657

Open
nikolauspschuetz wants to merge 1 commit into
kubeflow:masterfrom
nikolauspschuetz:feat/auto-pdb-gang-trainjob
Open

feat: auto-create PodDisruptionBudget for gang-scheduled TrainJobs#3657
nikolauspschuetz wants to merge 1 commit into
kubeflow:masterfrom
nikolauspschuetz:feat/auto-pdb-gang-trainjob

Conversation

@nikolauspschuetz

Copy link
Copy Markdown

What this PR does

Adds a PodDisruptionBudget ComponentBuilderPlugin that protects gang-scheduled TrainJobs from voluntary disruptions (node drains, cluster-autoscaler scale-downs). For a TrainJob with a PodGroupPolicy and more than one trainer replica, it builds a policy/v1 PodDisruptionBudget with minAvailable = trainer-replica count, a selector on the JobSet's pods (jobset.sigs.k8s.io/jobset-name), and a controller OwnerReference to the TrainJob. The plugin also watches PDBs so the controller reconciles drift. Plugin-only — no API/CRD change.

Fixes #3304

Why

Distributed training is tightly coupled: every rank must stay available for the job to make progress. A single voluntary eviction stalls the whole gang until it restarts from the last checkpoint, and no PDB was created anywhere in the project.

Design notes / open questions for reviewers

  • Trigger: gated on PodGroupPolicy != nil (gang-scheduled jobs), symmetric with the coscheduling plugin. The issue text says "any distributed TrainJob" — I went narrower/conservative on purpose; happy to broaden to all multi-pod jobs if preferred.
  • minAvailable = trainer replicas only (initializers excluded — they run to completion and stop being Ready, which would otherwise make the PDB permanently unsatisfiable). This matches the issue's "total number of training replicas."
  • @tariq-hasan's question on the issue (plugin vs. TrainJob/TrainingRuntime API change) is still open — this PR takes the plugin-only path. Glad to add a config gate instead if maintainers prefer opt-in.

Testing

  • Table-driven unit tests for the plugin (pdb_test.go); updated framework/runtime build tests; integration assertion in trainjob_controller_test.go.
  • Locally: go build ./..., go vet, gofmt, and all pkg/+cmd/ unit tests pass. Integration/e2e run in CI (envtest not bootstrapped locally).
  • RBAC regenerated via controller-gen (manifests/base/rbac/role.yaml + chart clusterrole.yaml).

🤖 Assisted by Claude Code (Anthropic); authored, reviewed, and verified by me.

Distributed training is tightly coupled: every rank must stay available
for the job to make progress. A single voluntary eviction (node drain,
cluster-autoscaler scale-down) stalls the whole gang until it restarts
from the last checkpoint, and no PodDisruptionBudget was created anywhere
in the project.

Add a ComponentBuilderPlugin that, for gang-scheduled TrainJobs
(PodGroupPolicy set) with more than one trainer replica, builds a
policy/v1 PodDisruptionBudget with:
  - minAvailable equal to the number of trainer replicas. Initializer
    pods are excluded because they run to completion and stop being
    Ready, which would otherwise make the PDB permanently unsatisfiable.
  - a selector matching the TrainJob's JobSet pods
    (jobset.sigs.k8s.io/jobset-name).
  - a controller OwnerReference to the TrainJob for garbage collection.

The plugin also watches PodDisruptionBudgets so the controller reconciles
drift. This is plugin-only; no API or CRD change.

Fixes kubeflow#3304

Assisted-by: Claude Code (Anthropic)
Signed-off-by: Nikolaus Schuetz <nikolauspschuetz@gmail.com>
Copilot AI review requested due to automatic review settings June 29, 2026 01:47
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new runtime framework plugin that automatically creates and reconciles a Kubernetes policy/v1 PodDisruptionBudget (PDB) for gang-scheduled (PodGroupPolicy-enabled) multi-replica TrainJobs, preventing voluntary disruptions from evicting training pods mid-run.

Changes:

  • Introduces a PodDisruptionBudget ComponentBuilderPlugin + WatchExtensionPlugin that builds a PDB with minAvailable = trainer replica count and a selector keyed to the JobSet’s pods.
  • Adds unit and integration test coverage plus shared testing wrappers for constructing/validating PDB objects.
  • Updates generated RBAC (manifests + Helm chart) to allow the controller to manage PDB resources.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/integration/controller/trainjob_controller_test.go Adds an integration assertion that a PDB is created for the gang-scheduled TrainJob.
pkg/util/testing/wrapper.go Adds a PDB test wrapper to construct expected policy/v1 PDB objects in tests.
pkg/runtime/framework/plugins/registry.go Registers the new pdb plugin in the framework plugin registry.
pkg/runtime/framework/plugins/pdb/pdb.go Implements the PDB builder + watch plugin for gang-scheduled multi-replica TrainJobs.
pkg/runtime/framework/plugins/pdb/pdb_test.go Adds table-driven unit tests validating PDB creation and replica counting behavior.
pkg/runtime/framework/core/framework_test.go Updates framework tests to include the new plugin and its expected built objects.
pkg/runtime/core/trainingruntime_test.go Updates runtime object-building tests to expect the new PDB output.
pkg/runtime/core/clustertrainingruntime_test.go Updates cluster runtime object-building tests to expect the new PDB output.
manifests/base/rbac/role.yaml Adds controller RBAC permissions for policy/poddisruptionbudgets (generated).
charts/kubeflow-trainer/templates/rbac/clusterrole.yaml Adds Helm chart RBAC permissions for policy/poddisruptionbudgets (generated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic PodDisruptionBudget for distributed TrainJobs

2 participants