feat: add Flux integration and e2e tests#3561
Conversation
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Nice work filling in the Flux test coverage gap — tests are passing and the structure looks solid. Had a few questions while reading through:
1. Integration test assertion style — any reason not to use MakeJobSetWrapper?
The MPI integration test builds an expected JobSet with MakeJobSetWrapper and compares the whole object via BeComparableTo. The Flux test fetches the JobSet and checks individual fields with manual for i := range loops. Both work, but the manual approach won't catch unexpected extra fields/containers sneaking in.
Was there a reason the wrapper approach didn't fit here, or would it be worth aligning with the MPI pattern?
2. What are the 3 replicated jobs?
g.Expect(jobSet.Spec.ReplicatedJobs).Should(gomega.HaveLen(3))Only the node job is inspected after this. Could you add a quick comment noting what the other 2 are? Helps future readers understand what's expected.
3. Suspend/resume flow — intentionally skipped?
The MPI Complete/Failed condition tests verify the Suspended=True → Suspended=False transition before checking completion. The Flux condition tests go straight from unsuspend to completion without checking the intermediate suspended state. Was that a deliberate simplification, or worth adding for parity?
4. PR title: feat: → test:?
Since this only adds tests and a test helper with no production code changes, should the title be test: add Flux integration and e2e tests per Conventional Commits?
Overall this looks good — just wanted to flag these before it goes further. Thanks for picking up #3256!
andreyvelich
left a comment
There was a problem hiding this comment.
/ok-to-test
/assign @vsoch
|
@andreyvelich: GitHub didn't allow me to assign the following users: vsoch. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for adding test coverage for the Flux runtime, @Amir380-A! 🙌 The integration tests are thorough and Flux-specific. Left a few inline suggestions below.
Note: this review was performed with the help of AI tools.
| }) | ||
|
|
||
| ginkgo.When("Creating TrainJob to perform Flux workload", func() { | ||
| ginkgo.It("should create TrainJob with Flux runtime reference", func() { |
There was a problem hiding this comment.
The other runtime e2e tests (jax/xgboost/deepspeed) reference a pre-installed *-distributed ClusterTrainingRuntime from manifests/base/runtimes/, but there is no flux_distributed.yaml, so this test builds the runtime inline. Consider adding a flux_distributed.yaml manifest and referencing it by name for consistency with the suite (and so users get a shippable Flux runtime).
There was a problem hiding this comment.
I started adding manifests/base/runtimes/flux_distributed.yaml, but unlike JAX/XGBoost/DeepSpeed, Flux doesn't have a dedicated runtime image. What image would you expect the manifest to use?
There was a problem hiding this comment.
Flux is installed automatically via the flux-installer init container (and associated container image), so your main container just needs the application it intends to run, and compiled with MPI if that is needed. If you have issues with compatibility, you can set an envar to choose a different base image. See:
trainer/pkg/runtime/framework/plugins/flux/flux.go
Lines 63 to 74 in 558798f
| }) | ||
|
|
||
| ginkgo.When("Creating TrainJob to perform Flux workload", func() { | ||
| ginkgo.It("should create TrainJob with Flux runtime reference", func() { |
There was a problem hiding this comment.
The node runs ubuntu:22.04 with bash -c true, which exits 0 regardless of whether it ran under flux. To make this a meaningful Flux check, consider asserting on a flux marker in the pod logs (or using a workload whose success depends on flux run) so a regression running the command outside flux would actually fail.
| ginkgo.By("Create Flux ClusterTrainingRuntime") | ||
| fluxRuntimeName := fluxRuntime + "-" + ns.Name | ||
| fluxClusterRuntimeWrapper := testingutil.MakeClusterTrainingRuntimeWrapper(fluxRuntimeName) | ||
| for _, rJob := range fluxClusterRuntimeWrapper.Spec.Template.Spec.ReplicatedJobs { |
There was a problem hiding this comment.
nit: prefer index-based iteration for i := range ... here to match the convention used elsewhere in this file and the codebase (the value-copy form works only because the slice is replaced right after). Or construct the single Node ReplicatedJob directly.
| Obj() | ||
| } | ||
|
|
||
| ginkgo.It("Should succeed to create TrainJob with Flux TrainingRuntime", func() { |
There was a problem hiding this comment.
The Flux Validate rules (NumProcPerNode >= 1 and rejecting a user-supplied reserved flux-installer init container) aren't covered anywhere. Could you add rejection cases? test/integration/webhooks/trainjob_test.go has the pattern via BeForbiddenError()/BeInvalidError().
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
26812f3 to
4f0d458
Compare
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
e9e6097 to
c1b9388
Compare
| @@ -0,0 +1,44 @@ | |||
| {{- /* | |||
| Copyright 2026 The Kubeflow authors. | |||
There was a problem hiding this comment.
@tariq-hasan @Amir380-A Please could you explore why our boilerplate checker doesn't return error on this file?
https://github.com/kubeflow/trainer/blob/master/Makefile#L200-L201
Our CI should validate that all new files don't have YEAR in their header.
Feel free to submit followup PR to fix it.
There was a problem hiding this comment.
YAML files are not included in the check. I am raising a PR for the fix.
There was a problem hiding this comment.
Also copyright header check is skipped if a year is detected. Should we enforce detection of year in all new files as part of the automated check?
There was a problem hiding this comment.
I guess, for new files we should ensure that YEAR is not present, right?
There was a problem hiding this comment.
Okay. I will make it part of the automated check.
There was a problem hiding this comment.
Hi @andreyvelich! I have raised #3665 that adds support for yaml files and enables auto-detection of new files violating the year-less requirement.
Since the second feature requires an update to the boilerplate logic I wanted to get your feedback on the proposed decision logic in the PR description before I finalized the implementation for review.
I am happy to iterate based on feedback.
| - jax_distributed.yaml | ||
| - xgboost_distributed.yaml | ||
| - torchtune | ||
| - flux_distributed.yaml |
There was a problem hiding this comment.
I am not sure if we want to include Flux runtime to default runtimes.
Could we only install it in E2Es for now?
What this PR does / why we need it:
Added Integration and E2E tests for Flux
Which issue(s) this PR fixes
Fixes #3256