Skip to content

Add SLES support for AMD gpu-operator#365

Merged
sajmera-pensando merged 4 commits into
ROCm:mainfrom
Priyankasaggu11929:enable-sles-support
May 19, 2026
Merged

Add SLES support for AMD gpu-operator#365
sajmera-pensando merged 4 commits into
ROCm:mainfrom
Priyankasaggu11929:enable-sles-support

Conversation

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor

@Priyankasaggu11929 Priyankasaggu11929 commented Oct 23, 2025

Motivation

This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP7+ to the AMD GPU operator.

Technical Details

  • abbdaa8 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions

    • add new slesCMNameMapper to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP7' to 'sles-15.7'
    • add SLESDefaultDriverVersionsMapper to select driver versions
    • register both 'sles' and 'suse' identifiers in mappers
  • 669b888 - add SLES Dockerfile template (DockerfileTemplate.sles) using prebuilt SUSE AMD GPU Driver image (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).

    • also embed the template via go:embed and add SLES case logic
  • c2dce44 - docs: update example/deviceconfig_example.yaml <- dropped

  • 017fcb5- use "registry.suse.com" as the default base image registry if OS == "sles"

    • although, use-specified BaseImageRegistry still takes precedence
    • also extend tests in internal/kmmodule/kmmodule_test.go to test above changes in resolveDockerfile func

Test Plan

  • 13248e7 - tests: update internal/utils_test.go for added support for SLES 15 SP*

Test Result

  • truncated output of make unit-test after new added tests
    > make unit-test
    ...
    ...
    === RUN   TestSLESDefaultDriverVersionsMapper
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP6
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP5
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP4
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_base
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format
    --- PASS: TestSLESDefaultDriverVersionsMapper (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP6 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP5 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP4 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_base (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format (0.00s)
    PASS
    coverage: 48.6% of statements
    ok  	github.com/ROCm/gpu-operator/internal	0.019s	coverage: 48.6% of statements
    === RUN   TestAPIs
    Running Suite: Controller Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/controllers
    ==========================================================================================================================
    Random Seed: 1761223798
    
    Will run 15 of 15 specs
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 7.9% of statements
    ok  	github.com/ROCm/gpu-operator/internal/controllers	(cached)	coverage: 7.9% of statements
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761223798
    
    Will run 5 of 5 specs
    testing multiple valid homogeneous nodes
    testing multiple valid heterogeneous nodes
    testing multiple valid heterogeneous nodes + one unsupported node
    testing multiple unsupported nodes
    testing empty node list
    •<moduleName>
    <amdgpu>
    •<moduleName>
    <amdgpu>
    •••
    
    Ran 5 of 5 Specs in 0.005 seconds
    SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 32.3% of statements
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	(cached)	coverage: 32.3% of statements
    
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    

  • output from tests added

    ❯ go test ./internal/kmmmodule/... -v -ginkgo.focus="resolveDockerfile" -ginkgo.v
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761548380
    
    Will run 3 of 8 specs
    SSSS
    ------------------------------
    resolveDockerfile should use correct default registry when not specified by user
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:683
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should respect user-specified BaseImageRegistry for all OS types
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:702
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should return error for unsupported OS
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:727
    • [0.000 seconds]
    ------------------------------
    S
    
    Ran 3 of 8 Specs in 0.000 seconds
    SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 5 Skipped
    --- PASS: TestAPIs (0.00s)
    PASS
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	0.022s
    

Submission Checklist

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7.
Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

@yansun1996
Copy link
Copy Markdown
Member

yansun1996 commented Oct 23, 2025

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.

Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.
Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

Yes, I'll keep posting updates. Thank you!

Comment thread example/deviceconfig_example.yaml Outdated
Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware

Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

Created PR for staging branch - #371

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

Thank you so much!

Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests.

leslie-qiwa pushed a commit to leslie-qiwa/gpu-operator that referenced this pull request Feb 6, 2026
* Automate the helm README build in sanity

* address comments
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the enable-sles-support branch 2 times, most recently from 666be77 to 7dfec5e Compare May 4, 2026 07:32
@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hello @yansun1996, I have updated the PR today with latest changes.

We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with AMD Radeon Pro V520 (7362) GPU device and can confirm that the AMD gpu-operator is able to detect SLES nodes and publish the available GPU devices to workloads requesting GPUs (across all kernel versions of SLES 15 SP7 codestream).

Requesting your review again on the PR changes.

(Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371)


I used the following patch for gpu-operator to detect Radeon Pro V520 (7362) device.

> cat v520-device-support.patch 
diff --git a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
index 7d0269cd..531bc06b 100644
--- a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
+++ b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
@@ -64,6 +64,11 @@ spec:
             matchExpressions:
                 vendor: {op: In, value: ["1002"]}
                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+                vendor: {op: In, value: ["1002"]}
+                device: {op: In, value: ["7362"]} # Radeon Pro V520
   - name: amd-gpu
     labels:
       feature.node.kubernetes.io/amd-gpu: "true"
@@ -185,6 +190,11 @@ spec:
             matchExpressions:
               vendor: {op: In, value: ["1002"]}
               device: {op: In, value: ["73a1"]} # V620
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+              vendor: {op: In, value: ["1002"]}
+              device: {op: In, value: ["7362"]} # V520
       - matchFeatures:
           - feature: pci.device
             matchExpressions:
diff --git a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
index 7d0269cd..531bc06b 100644
--- a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
+++ b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
@@ -64,6 +64,11 @@ spec:
             matchExpressions:
                 vendor: {op: In, value: ["1002"]}
                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+                vendor: {op: In, value: ["1002"]}
+                device: {op: In, value: ["7362"]} # Radeon Pro V520
   - name: amd-gpu
     labels:
       feature.node.kubernetes.io/amd-gpu: "true"
@@ -185,6 +190,11 @@ spec:
             matchExpressions:
               vendor: {op: In, value: ["1002"]}
               device: {op: In, value: ["73a1"]} # V620
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+              vendor: {op: In, value: ["1002"]}
+              device: {op: In, value: ["7362"]} # V520
       - matchFeatures:
           - feature: pci.device
             matchExpressions:

@yansun1996
Copy link
Copy Markdown
Member

Hello @yansun1996, I have updated the PR today with latest changes.

We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with AMD Radeon Pro V520 (7362) GPU device and can confirm that the AMD gpu-operator is able to detect SLES nodes and publish the available GPU devices to workloads requesting GPUs (across all kernel versions of SLES 15 SP7 codestream).

Requesting your review again on the PR changes.

(Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371)


I used the following patch for gpu-operator to detect Radeon Pro V520 (7362) device.


> cat v520-device-support.patch 

diff --git a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

index 7d0269cd..531bc06b 100644

--- a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

+++ b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

@@ -64,6 +64,11 @@ spec:

             matchExpressions:

                 vendor: {op: In, value: ["1002"]}

                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+                vendor: {op: In, value: ["1002"]}

+                device: {op: In, value: ["7362"]} # Radeon Pro V520

   - name: amd-gpu

     labels:

       feature.node.kubernetes.io/amd-gpu: "true"

@@ -185,6 +190,11 @@ spec:

             matchExpressions:

               vendor: {op: In, value: ["1002"]}

               device: {op: In, value: ["73a1"]} # V620

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+              vendor: {op: In, value: ["1002"]}

+              device: {op: In, value: ["7362"]} # V520

       - matchFeatures:

           - feature: pci.device

             matchExpressions:

diff --git a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

index 7d0269cd..531bc06b 100644

--- a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

+++ b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

@@ -64,6 +64,11 @@ spec:

             matchExpressions:

                 vendor: {op: In, value: ["1002"]}

                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+                vendor: {op: In, value: ["1002"]}

+                device: {op: In, value: ["7362"]} # Radeon Pro V520

   - name: amd-gpu

     labels:

       feature.node.kubernetes.io/amd-gpu: "true"

@@ -185,6 +190,11 @@ spec:

             matchExpressions:

               vendor: {op: In, value: ["1002"]}

               device: {op: In, value: ["73a1"]} # V620

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+              vendor: {op: In, value: ["1002"]}

+              device: {op: In, value: ["7362"]} # V520

       - matchFeatures:

           - feature: pci.device

             matchExpressions:

Hi @Priyankasaggu11929 thanks for the update and verification, let me discuss with the team about this PR, will get back to you.

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

thanks for the update and verification, let me discuss with the team about this PR, will get back to you.

Thank you.

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hello @yansun1996, could you help with some information on the following:

We were waiting for a new amdgpu driver modules release that supports SLES 16.0, and just noticed that the modules are available under tag v31.20: https://repo.radeon.com/amdgpu/31.20/sle/16.0/main/x86_64/

Previously, we had been using the v7.0.3 release as the latest release: https://repo.radeon.com/amdgpu/7.0.3/sle/
and were monitoring the latest branch here: https://repo.radeon.com/amdgpu/latest/sle/

What is the difference between the 7.x.x and 31.x.x versioning schemes? And Which release stream are the recommended one to use?

@yansun1996
Copy link
Copy Markdown
Member

Hello @yansun1996, could you help with some information on the following:

We were waiting for a new amdgpu driver modules release that supports SLES 16.0, and just noticed that the modules are available under tag v31.20: https://repo.radeon.com/amdgpu/31.20/sle/16.0/main/x86_64/

Previously, we had been using the v7.0.3 release as the latest release: https://repo.radeon.com/amdgpu/7.0.3/sle/ and were monitoring the latest branch here: https://repo.radeon.com/amdgpu/latest/sle/

What is the difference between the 7.x.x and 31.x.x versioning schemes? And Which release stream are the recommended one to use?

Hi @Priyankasaggu11929 ,

I will cherry-pick your commits for internal CI (not public available yet)

as for your question: v7.x.x and v30.x.x

there is a recent driver version change happened.

Previously in driver version <= v7.0.x you will see that each amdgpu release has followed the ROCm release version

However in recent amdgpu driver releases, the release version has diverged from ROCm release, start to use 30.x.x

So as of now, amdgpu driver has its own release and ROCm runtime libs has their release version

for more information please check https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html#user-and-amd-gpu-driver-amdgpu-support-matrix

Comment on lines +100 to +104
var slesCSDPrebuiltDriverImages = map[string]map[string]string{
"15.7": {
"7.0.3": "registry.suse.com/third-party/amd/amdgpu-driver:sles-15.7-7.0.3",
},
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — data tables are inconsistent. This prebuilt-image table only registers 15.7 → 7.0.3, but SLESDefaultDriverVersionsMapper (utils.go) defaults to 7.0.2 for SP6 and 6.2.2 for SP5/base. With those defaults, the lookup in getKM (around line 572) silently misses and SUSE_PREBUILT_DRIVER_IMG is never injected. The new Dockerfile template has no fallback — FROM ${SUSE_PREBUILT_DRIVER_IMG} AS driver-source resolves to FROM AS driver-source, which fails the build.

Net effect: with the defaults this PR ships, SP5 and SP6 nodes (which the PR claims to support) cannot build. Either add prebuilt entries for the SP5/SP6 default driver versions, narrow SLESDefaultDriverVersionsMapper to only return versions present here, or surface an explicit error on lookup miss.

Copy link
Copy Markdown
Contributor Author

@Priyankasaggu11929 Priyankasaggu11929 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer.

SLESDefaultDriverVersionsMapper in utils.go was supposed to be cleaned up as part of my previous commit refreshes.

I have narrowed the supported versions to SLES 15 SP7 only and keeping just the 15.7 -> 7.0.3 entry in the table. Since, we are now going to build the prebuilt driver container images starting SLES 15 SP7.

(I'll either send a new PR or update this one to add SLES 16.0 as well to the table, as soon as the respective prebuilt container image is ready).

Also addressed the silent-skip behavior to now properly return error messages on any version outside SLES 15.7 for now.

Also, please note - I will update the driver version to use the new amdgpu driver version - v31.20 once the respective prebuilt container image is ready and published. Thanks!

Comment thread internal/kmmmodule/kmmmodule.go
Comment thread internal/utils.go Outdated
Comment thread internal/kmmmodule/kmmmodule.go Outdated
Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL the comments @Priyankasaggu11929 , thanks

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

as for your question: v7.x.x and v30.x.x
there is a recent driver version change happened.
Previously in driver version <= v7.0.x you will see that each amdgpu release has followed the ROCm release version
However in recent amdgpu driver releases, the release version has diverged from ROCm release, start to use 30.x.x
So as of now, amdgpu driver has its own release and ROCm runtime libs has their release version
for more information please check https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html#user-and-amd-gpu-driver-amdgpu-support-matrix

Thank you. Then I will update the driver version to use v31.20 adhering to the new release version scheme.

Priyankasaggu11929 and others added 4 commits May 11, 2026 11:05
…opriate AMD GPU driver versions

* add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP7' to 'sles-15.7'
* add `SLESDefaultDriverVersionsMapper` to select driver versions
* register both 'sles' and 'suse' identifiers in mappers

Co-authored-by: alex-isv <alex.zacharow@suse.com>
…sles"

* although, use-specified `BaseImageRegistry` still takes precedence

* also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
@yansun1996
Copy link
Copy Markdown
Member

Hi @Priyankasaggu11929 , are all the changed done here ? If yes i will trigger internal CI pipeline

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hi @Priyankasaggu11929 , are all the changed done here ? If yes i will trigger internal CI pipeline

yes, all changes for configuration - SLES SP7 + amdgpu driver modules v7.0.3 are complete.

We're working on creating container images for newer amdgpu driver modules releases. But that is non-blocking for this PR and I'll add them in follow up PRs.

@sajmera-pensando sajmera-pensando merged commit 75c9cdd into ROCm:main May 19, 2026
3 checks passed
@yansun1996
Copy link
Copy Markdown
Member

Hi @Priyankasaggu11929 we have merged this PR, if you have any other pull request pls let us know, thanks for your contribution !

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

Hi @Priyankasaggu11929 we have merged this PR, if you have any other pull request pls let us know, thanks for your contribution !

I'll create another PR SLES 16.0 (as soon as the respective driver container images are published on registry.suse.com) - will ping you on that. Thank you for your help so far, @yansun1996.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants