Skip to content

feat: add Flux integration and e2e tests#3561

Open
Amir380-A wants to merge 9 commits into
kubeflow:masterfrom
Amir380-A:test/flux-integration-e2e
Open

feat: add Flux integration and e2e tests#3561
Amir380-A wants to merge 9 commits into
kubeflow:masterfrom
Amir380-A:test/flux-integration-e2e

Conversation

@Amir380-A

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:
Added Integration and E2E tests for Flux

Which issue(s) this PR fixes
Fixes #3256

Copilot AI review requested due to automatic review settings May 29, 2026 11:11
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@Amir380-A Amir380-A changed the title Add flux integration and e2e tests add flux integration and e2e tests May 29, 2026
@Amir380-A Amir380-A changed the title add flux integration and e2e tests fix add flux integration and e2e tests May 29, 2026
@Amir380-A Amir380-A changed the title fix add flux integration and e2e tests feat: add Flux integration and e2e tests May 29, 2026

@Sridhar1030 Sridhar1030 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work filling in the Flux test coverage gap — tests are passing and the structure looks solid. Had a few questions while reading through:


1. Integration test assertion style — any reason not to use MakeJobSetWrapper?

The MPI integration test builds an expected JobSet with MakeJobSetWrapper and compares the whole object via BeComparableTo. The Flux test fetches the JobSet and checks individual fields with manual for i := range loops. Both work, but the manual approach won't catch unexpected extra fields/containers sneaking in.

Was there a reason the wrapper approach didn't fit here, or would it be worth aligning with the MPI pattern?


2. What are the 3 replicated jobs?

g.Expect(jobSet.Spec.ReplicatedJobs).Should(gomega.HaveLen(3))

Only the node job is inspected after this. Could you add a quick comment noting what the other 2 are? Helps future readers understand what's expected.


3. Suspend/resume flow — intentionally skipped?

The MPI Complete/Failed condition tests verify the Suspended=TrueSuspended=False transition before checking completion. The Flux condition tests go straight from unsuspend to completion without checking the intermediate suspended state. Was that a deliberate simplification, or worth adding for parity?


4. PR title: feat:test:?

Since this only adds tests and a test helper with no production code changes, should the title be test: add Flux integration and e2e tests per Conventional Commits?


Overall this looks good — just wanted to flag these before it goes further. Thanks for picking up #3256!

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/assign @vsoch

@google-oss-prow

Copy link
Copy Markdown

@andreyvelich: GitHub didn't allow me to assign the following users: vsoch.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/ok-to-test
/assign @vsoch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding test coverage for the Flux runtime, @Amir380-A! 🙌 The integration tests are thorough and Flux-specific. Left a few inline suggestions below.

Note: this review was performed with the help of AI tools.

Comment thread test/e2e/e2e_test.go
})

ginkgo.When("Creating TrainJob to perform Flux workload", func() {
ginkgo.It("should create TrainJob with Flux runtime reference", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other runtime e2e tests (jax/xgboost/deepspeed) reference a pre-installed *-distributed ClusterTrainingRuntime from manifests/base/runtimes/, but there is no flux_distributed.yaml, so this test builds the runtime inline. Consider adding a flux_distributed.yaml manifest and referencing it by name for consistency with the suite (and so users get a shippable Flux runtime).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started adding manifests/base/runtimes/flux_distributed.yaml, but unlike JAX/XGBoost/DeepSpeed, Flux doesn't have a dedicated runtime image. What image would you expect the manifest to use?

@vsoch vsoch Jun 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flux is installed automatically via the flux-installer init container (and associated container image), so your main container just needs the application it intends to run, and compiled with MPI if that is needed. If you have issues with compatibility, you can set an envar to choose a different base image. See:

// the flux view image is the base OS / version for the view to install flux
// ghcr.io/converged-computing/flux-view-rocky:arm-9
// ghcr.io/converged-computing/flux-view-rocky:arm-8
// ghcr.io/converged-computing/flux-view-rocky:tag-9
// ghcr.io/converged-computing/flux-view-rocky:tag-8
// ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
// ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
// ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
// We use an ubuntu (more recent) default since it is common
"FLUX_VIEW_IMAGE": constants.FluxInstallerImage,

Comment thread test/e2e/e2e_test.go
})

ginkgo.When("Creating TrainJob to perform Flux workload", func() {
ginkgo.It("should create TrainJob with Flux runtime reference", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node runs ubuntu:22.04 with bash -c true, which exits 0 regardless of whether it ran under flux. To make this a meaningful Flux check, consider asserting on a flux marker in the pod logs (or using a workload whose success depends on flux run) so a regression running the command outside flux would actually fail.

Comment thread test/e2e/e2e_test.go Outdated
ginkgo.By("Create Flux ClusterTrainingRuntime")
fluxRuntimeName := fluxRuntime + "-" + ns.Name
fluxClusterRuntimeWrapper := testingutil.MakeClusterTrainingRuntimeWrapper(fluxRuntimeName)
for _, rJob := range fluxClusterRuntimeWrapper.Spec.Template.Spec.ReplicatedJobs {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer index-based iteration for i := range ... here to match the convention used elsewhere in this file and the codebase (the value-copy form works only because the slice is replaced right after). Or construct the single Node ReplicatedJob directly.

Obj()
}

ginkgo.It("Should succeed to create TrainJob with Flux TrainingRuntime", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flux Validate rules (NumProcPerNode >= 1 and rejecting a user-supplied reserved flux-installer init container) aren't covered anywhere. Could you add rejection cases? test/integration/webhooks/trainjob_test.go has the pattern via BeForbiddenError()/BeInvalidError().

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
@Amir380-A Amir380-A force-pushed the test/flux-integration-e2e branch from 26812f3 to 4f0d458 Compare June 27, 2026 09:52
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
@google-oss-prow google-oss-prow Bot added size/XL and removed size/L labels Jun 27, 2026
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>
@Amir380-A Amir380-A force-pushed the test/flux-integration-e2e branch from e9e6097 to c1b9388 Compare June 27, 2026 12:29
@@ -0,0 +1,44 @@
{{- /*
Copyright 2026 The Kubeflow authors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq-hasan @Amir380-A Please could you explore why our boilerplate checker doesn't return error on this file?
https://github.com/kubeflow/trainer/blob/master/Makefile#L200-L201
Our CI should validate that all new files don't have YEAR in their header.

Feel free to submit followup PR to fix it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAML files are not included in the check. I am raising a PR for the fix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also copyright header check is skipped if a year is detected. Should we enforce detection of year in all new files as part of the automated check?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, for new files we should ensure that YEAR is not present, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will make it part of the automated check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich! I have raised #3665 that adds support for yaml files and enables auto-detection of new files violating the year-less requirement.

Since the second feature requires an update to the boilerplate logic I wanted to get your feedback on the proposed decision logic in the PR description before I finalized the implementation for review.

I am happy to iterate based on feedback.

- jax_distributed.yaml
- xgboost_distributed.yaml
- torchtune
- flux_distributed.yaml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to include Flux runtime to default runtimes.
Could we only install it in E2Es for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Testing] Integration and E2E tests for Flux Framework

5 participants