Add SLES support for AMD gpu-operator#365
Conversation
|
Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works! |
Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR. Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks ! |
Yes, I'll keep posting updates. Thank you! |
yansun1996
left a comment
There was a problem hiding this comment.
one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware
d46ce29 to
4da60d3
Compare
yansun1996
left a comment
There was a problem hiding this comment.
thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x
once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support
Created PR for staging branch - #371
Thank you so much! Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests. |
* Automate the helm README build in sanity * address comments
666be77 to
7dfec5e
Compare
|
Hello @yansun1996, I have updated the PR today with latest changes. We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with Requesting your review again on the PR changes. (Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371) I used the following patch for gpu-operator to detect |
Hi @Priyankasaggu11929 thanks for the update and verification, let me discuss with the team about this PR, will get back to you. |
Thank you. |
|
Hello @yansun1996, could you help with some information on the following: We were waiting for a new amdgpu driver modules release that supports Previously, we had been using the What is the difference between the |
Hi @Priyankasaggu11929 , I will cherry-pick your commits for internal CI (not public available yet) as for your question: v7.x.x and v30.x.x there is a recent driver version change happened. Previously in driver version <= v7.0.x you will see that each amdgpu release has followed the ROCm release version However in recent amdgpu driver releases, the release version has diverged from ROCm release, start to use 30.x.x So as of now, amdgpu driver has its own release and ROCm runtime libs has their release version for more information please check https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html#user-and-amd-gpu-driver-amdgpu-support-matrix |
| var slesCSDPrebuiltDriverImages = map[string]map[string]string{ | ||
| "15.7": { | ||
| "7.0.3": "registry.suse.com/third-party/amd/amdgpu-driver:sles-15.7-7.0.3", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Bug — data tables are inconsistent. This prebuilt-image table only registers 15.7 → 7.0.3, but SLESDefaultDriverVersionsMapper (utils.go) defaults to 7.0.2 for SP6 and 6.2.2 for SP5/base. With those defaults, the lookup in getKM (around line 572) silently misses and SUSE_PREBUILT_DRIVER_IMG is never injected. The new Dockerfile template has no fallback — FROM ${SUSE_PREBUILT_DRIVER_IMG} AS driver-source resolves to FROM AS driver-source, which fails the build.
Net effect: with the defaults this PR ships, SP5 and SP6 nodes (which the PR claims to support) cannot build. Either add prebuilt entries for the SP5/SP6 default driver versions, narrow SLESDefaultDriverVersionsMapper to only return versions present here, or surface an explicit error on lookup miss.
There was a problem hiding this comment.
Thanks for the pointer.
SLESDefaultDriverVersionsMapper in utils.go was supposed to be cleaned up as part of my previous commit refreshes.
I have narrowed the supported versions to SLES 15 SP7 only and keeping just the 15.7 -> 7.0.3 entry in the table. Since, we are now going to build the prebuilt driver container images starting SLES 15 SP7.
(I'll either send a new PR or update this one to add SLES 16.0 as well to the table, as soon as the respective prebuilt container image is ready).
Also addressed the silent-skip behavior to now properly return error messages on any version outside SLES 15.7 for now.
Also, please note - I will update the driver version to use the new amdgpu driver version - v31.20 once the respective prebuilt container image is ready and published. Thanks!
yansun1996
left a comment
There was a problem hiding this comment.
PTAL the comments @Priyankasaggu11929 , thanks
7dfec5e to
017fcb5
Compare
Thank you. Then I will update the driver version to use |
…opriate AMD GPU driver versions * add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP7' to 'sles-15.7' * add `SLESDefaultDriverVersionsMapper` to select driver versions * register both 'sles' and 'suse' identifiers in mappers Co-authored-by: alex-isv <alex.zacharow@suse.com>
… SUSE AMD GPU driver image
…sles" * although, use-specified `BaseImageRegistry` still takes precedence * also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
017fcb5 to
5a0c824
Compare
|
Hi @Priyankasaggu11929 , are all the changed done here ? If yes i will trigger internal CI pipeline |
yes, all changes for configuration - We're working on creating container images for newer amdgpu driver modules releases. But that is non-blocking for this PR and I'll add them in follow up PRs. |
|
Hi @Priyankasaggu11929 we have merged this PR, if you have any other pull request pls let us know, thanks for your contribution ! |
I'll create another PR SLES 16.0 (as soon as the respective driver container images are published on registry.suse.com) - will ping you on that. Thank you for your help so far, @yansun1996. |
Motivation
This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP7+ to the AMD GPU operator.
Technical Details
abbdaa8 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions
slesCMNameMapperto parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP7' to 'sles-15.7'SLESDefaultDriverVersionsMapperto select driver versions669b888 - add SLES Dockerfile template (
DockerfileTemplate.sles) using prebuilt SUSE AMD GPU Driver image (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).c2dce44 - docs: update example/deviceconfig_example.yaml<- dropped017fcb5- use "registry.suse.com" as the default base image registry if OS == "sles"
BaseImageRegistrystill takes precedenceinternal/kmmodule/kmmodule_test.goto test above changes inresolveDockerfilefuncTest Plan
Test Result
make unit-testafter new added testsoutput from tests added
Submission Checklist