Skip to content

Docker: CUDA 12.4->12.9, enable NNCL, support CCv7#1468

Open
sempervictus wants to merge 3 commits into
EricLBuehler:masterfrom
sempervictus:feature/cuda_docker_update
Open

Docker: CUDA 12.4->12.9, enable NNCL, support CCv7#1468
sempervictus wants to merge 3 commits into
EricLBuehler:masterfrom
sempervictus:feature/cuda_docker_update

Conversation

@sempervictus

@sempervictus sempervictus commented Jun 13, 2025

Copy link
Copy Markdown
Contributor

Update CUDA (base image containers) to 12.9.

Drop minimum compute capability requirement to v7 - mistral-rs is great on older devices which do not support flash attention (in the same hardware facilities as v8+).

Enable NCCL feature for the CUDA build target.

Notes:
Depends on EricLBuehler/candle#83 or
equivalent change to support 12.9 (max supported right now is 12.8)

This should help to get better mileage for H/B series users
especially on Open drivers

Summary by CodeRabbit

  • Chores
    • Updated CUDA environment to version 12.9.0 for improved compatibility.
    • Adjusted GPU compute capability target for broader hardware support.
    • Enhanced build features by enabling NCCL support.
    • Added additional runtime dependencies for improved performance and communication.

@coderabbitai

coderabbitai Bot commented Jun 13, 2025

Copy link
Copy Markdown

Walkthrough

The Dockerfile for the CUDA build environment was updated to use CUDA version 12.9.0 instead of 12.4.1, with a lowered compute capability argument from 80 to 70. The build features were expanded to include nccl alongside cuda and cudnn. The runtime stage now installs openmpi-bin and creates a symbolic link for libnccl.so in addition to existing dependencies.

Changes

File(s) Change Summary
Dockerfile.cuda-all Updated CUDA base images to 12.9.0; set compute capability to 70; added nccl to build features; added openmpi-bin to runtime stage; created libnccl.so symlink and updated LD_LIBRARY_PATH.

Poem

A Docker hop with CUDA anew,
From 12.4 to 12.9 we flew!
Compute drops down, but features grow—
Now with NCCL in tow.
OpenMPI joins the run,
Building faster, having fun!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7315e27 and b83c1d8.

📒 Files selected for processing (1)
  • Dockerfile.cuda-all (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile.cuda-all
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Docs
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Clippy
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions

github-actions Bot commented Jun 13, 2025

Copy link
Copy Markdown
Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           63           54            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           39           22            8            9
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   22          410          407            0            3
 Makefile                1            6            5            0            1
 Python                102         5660         4631          298          731
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          866          797           11           58
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               74         6981            0         5227         1754
 |- BASH                19          299          260           24           15
 |- JSON                11          523          523            0            0
 |- Python              14          521          434           35           52
 |- Rust                32         1320         1108           36          176
 |- TOML                 2           75           63            0           12
 (Total)                           9719         2388         5322         2009
-------------------------------------------------------------------------------
 Rust                  421       156096       137867         3983        14246
 |- Markdown           200         4347          285         3497          565
 (Total)                         160443       138152         7480        14811
===============================================================================
 Total                 665       175876       145368        12159        18349
===============================================================================

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Dockerfile.cuda-all (1)

29-29: Lower minimum compute capability to 70
Dropping the CC requirement from 80 to 70 broadens device support. Please update any documentation (e.g., README, CI matrices) to reflect CC ≥ 7.0 as the new minimum.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f3b1afa and ae4949e.

📒 Files selected for processing (1)
  • Dockerfile.cuda-all (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Docs
  • GitHub Check: Clippy
  • GitHub Check: Check (macOS-latest, stable)
  • GitHub Check: Check (ubuntu-latest, stable)
  • GitHub Check: Test Suite (ubuntu-latest, stable)
  • GitHub Check: Check (windows-latest, stable)
  • GitHub Check: Test Suite (windows-latest, stable)
  • GitHub Check: Test Suite (macOS-latest, stable)
  • GitHub Check: comment
🔇 Additional comments (2)
Dockerfile.cuda-all (2)

3-3: Upgrade CUDA builder image to 12.9.0
The base image has been bumped to nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04. Ensure that the dependent candle repo PR (#83) or an equivalent update has been merged so that CUDA 12.9 is fully supported across your toolchain.


36-36: Upgrade CUDA runtime image to 12.9.0
The runtime image was updated to nvidia/cuda:12.9.0-cudnn-runtime-ubuntu22.04. Confirm it exists and provides the necessary CUDA/CUDNN (and NCCL, if used at runtime) libraries.

Comment thread Dockerfile.cuda-all Outdated
@sempervictus

Copy link
Copy Markdown
Contributor Author

Tested working on V100s w/ NCCL distribution. It seems a bit VRAM hungry

@polarathene

Copy link
Copy Markdown
Contributor

This should help to get better mileage for H/B series users especially on Open drivers

Could you please clarify?

Forward compatibility

By default PTX of an earlier CC should still be present to provide forward compatibility support of newer GPUs. Depending on the GPU, CUDA 12.8 or 12.9 libs should only matter when linking to libs to use API calls that have either as a minimum requirement to build. Otherwise AFAIK when using older virtual arch via PTX, the CUDA version should only matter on the runtime side?

CUDA version should be unrelated to Compute Capability, other than linked libs containing ELF/PTX kernels for the newer GPU archs, but without static linking none of that should matter at build time (the user can use a newer runtime to get their sm_100 / sm_120 optimizations).

So I'm not sure if updating cudarc via the referenced PR actually matters when it comes to running on newer GPUs? (other than the reasonable amount of fixes/changes over the past 6 months since 0.13.3 unrelated to newer CUDA release support, probably doesn't help that the usage is via a Nov 2024 0.8.0 fork of candle?)

The other scenario is with bindgen_cuda (NOTE: mistral.rs presently uses a fork), and using nvcc with build support for the newer GPU archs / CC when the project compiles it's own CUDA kernels at build for a target CC (like mistral.rs does for some crates).

That won't be relevant either however with the default images built and published, unless CI is also updated accordingly, users would need to perform custom builds themselves with ARGs.

Have you considered adding additional ARGs for the build/runtime base images instead? For custom builds that'd make bumping the CUDA version for either quite easy.

CUDA_COMPUTE_CAP should only need to be set if you want to target a higher CC, but AFAIK those still typically require explicit conditional compilation logic somewhere. If you just want the optimized kernels from the common CUDA libs, bumping the runtime image alone should be sufficient when no API calls are changing.

I don't have access to the newer GPU hardware to try, but I'd love to see a performance comparison of the CC version making a difference for a CUDA kernel that doesn't explicitly leverage newer features or data types.


Reference

libcublas.so or libcublas_static.a for example main benefit is sm_120:

$ docker run --rm -it nvcr.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04

# CC 12.0 PTX:
$ cuobjdump --list-ptx /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_120.

# Real arch cubins embedded:
$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_100.
.sm_120.
.sm_50.
.sm_50a.
.sm_60.
.sm_60a.
.sm_61.
.sm_61a.
.sm_70.
.sm_75.
.sm_80.
.sm_86.
.sm_90.

These are available for CUDA 12.8 (12.9 is important to avoid some issues from 12.8 with newer GPUs sm_100 / sm_120, but shouldn't matter at build time for earlier PTX compute_80 for example, but may require runtime CUDA 12.8 / 12.9):

$ docker run --rm -it fedora:41
$ dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo

# NOTE: If building for CC 100/120+ prefer CUDA 12.9
$ dnf install cuda-nvcc-12-8

$ /usr/local/cuda-12.8/bin/nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120

@polarathene polarathene left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Revert changing the default/fallback CUDA_COMPUTE_CAP
  • If OpenMPI is complimentary in the runtime image, clarify that via comment. Documenting an example for usage with the NCCL feature may be beneficial to users.
  • Add better contextual comment regarding symlink requirement only applying to runtime image. Use a relative symlink like devel.
  • Drop LD_LIBRARY_PATH addition, unless it's provably required, then provide context as to what requires it.

Comment thread Dockerfile.cuda-all
# Rayon threads are limited to minimize memory requirements in CI, avoiding OOM
# Rust threads are increased with a nightly feature for faster compilation (single-threaded by default)
ARG CUDA_COMPUTE_CAP=80
ARG CUDA_COMPUTE_CAP=70

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎


This is just the default ARG, what is the motivation to default to a lower CC by default?

Below CC 8.0 lacks support for BF16 for example. Defaults should strike a balance that suits a wider audience without lowering optimizations to support hardware that is approx a decade old? (just like defaulting to hardware that is too new sacrifices compatibility for performance)

The default should not be lowered here IMO since a custom build can be done (and in this case you can have it built with CI to publish the image), so keep it at 80.

@polarathene polarathene Jun 15, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently responded to a prior discussion you were in, where I referenced another project that documents Turing (CC 7.5) as the min supported Compute Capability for building.

CUDA release notes also documents a deprecation notice of Maxwell to Volta (CC 5.0 to 7.2) being dropped from CUDA 13 onwards.

Supporting CC 7.0 at that point will require additional maintenance/ARGs, or holding the image back from migrating to CUDA 13 when it's suitable to upgrade to. Using ARGs for the base image can help, where CI matrix can adjust support for different base images.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still cuda 12, so 7 is supported. When making all of the changes to 13 this should be bumped. Meantime, there are more systems supporting 7 as its a subset of 10 (current). This does not enable any features unsupported by modern hardware to cause incompatibility. Tying to broaden adoption - big world, lots of folks w older fare and academia isn't exactly brimming with cash to buy b300s.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry? Are you're not understanding my feedback,or how a Dockerfile works?

There is absolutely no need to change the ARG here. It will not magically produce official images with that capability, the only convenience you get is when manually building the image you can skip --build-arg CUDA_COMPUTE_CAP=70.

There is absolutely no need for that minor convenience for users on hardware that's approx a decade old. Not when it opts out of performance benefits from higher CC for the wider audience of users (as a default).

A default is for the broader audience and what suits them best, not the oldest hardware possible to support at the expense of the majority of users.

Leave it at 80 and change it for your own personal builds. If you want to justify 70 as a default show how many references on this project has users expressing frustration from not knowing how to build the image with their CC.

If instead you just want official builds for CC 7.0, then instead adjust the CI workflow that builds the variants and publishes to GHCR. Your change here is the wrong place for such.

If you need an easy example of CC support, data types are a common one. Take FP16, if you build with CC 5.3 or newer you can use it. Anything older and the build would fail unless there's an eexplicit fallback in the kernel.

As a default you are intentionally dismissing the more optimal features for the bulk of cuda devices supported to avoid providing an extra arg when extra compatibility is needed. That is not how defaults should be chosen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'm missing the part where the minimum compute capability defined limits use of facilities available in newer hardware... This is intended to broaden the range of hardware on which the container will run correctly if built with defaults. If the parameter prevents the resulting code from leveraging newer hardware correctly, then it makes sense to pull the change and probably wire-in a resolver for the users' environment to detect that at build-time if not provided.

Comment thread Dockerfile.cuda-all Outdated
Comment thread Dockerfile.cuda-all Outdated
@sempervictus

sempervictus commented Jun 15, 2025

Copy link
Copy Markdown
Contributor Author

This should help to get better mileage for H/B series users especially on Open drivers

Could you please clarify?

Forward compatibility

By default PTX of an earlier CC should still be present to provide forward compatibility support of newer GPUs. Depending on the GPU, CUDA 12.8 or 12.9 libs should only matter when linking to libs to use API calls that have either as a minimum requirement to build. Otherwise AFAIK when using older virtual arch via PTX, the CUDA version should only matter on the runtime side?

CUDA version should be unrelated to Compute Capability, other than linked libs containing ELF/PTX kernels for the newer GPU archs, but without static linking none of that should matter at build time (the user can use a newer runtime to get their sm_100 / sm_120 optimizations).

So I'm not sure if updating cudarc via the referenced PR actually matters when it comes to running on newer GPUs? (other than the reasonable amount of fixes/changes over the past 6 months since 0.13.3 unrelated to newer CUDA release support, probably doesn't help that the usage is via a Nov 2024 0.8.0 fork of candle?)

The other scenario is with bindgen_cuda (NOTE: mistral.rs presently uses a fork), and using nvcc with build support for the newer GPU archs / CC when the project compiles it's own CUDA kernels at build for a target CC (like mistral.rs does for some crates).

That won't be relevant either however with the default images built and published, unless CI is also updated accordingly, users would need to perform custom builds themselves with ARGs.

Have you considered adding additional ARGs for the build/runtime base images instead? For custom builds that'd make bumping the CUDA version for either quite easy.

CUDA_COMPUTE_CAP should only need to be set if you want to target a higher CC, but AFAIK those still typically require explicit conditional compilation logic somewhere. If you just want the optimized kernels from the common CUDA libs, bumping the runtime image alone should be sufficient when no API calls are changing.

I don't have access to the newer GPU hardware to try, but I'd love to see a performance comparison of the CC version making a difference for a CUDA kernel that doesn't explicitly leverage newer features or data types.

Reference

libcublas.so or libcublas_static.a for example main benefit is sm_120:

$ docker run --rm -it nvcr.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04

# CC 12.0 PTX:
$ cuobjdump --list-ptx /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_120.

# Real arch cubins embedded:
$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_100.
.sm_120.
.sm_50.
.sm_50a.
.sm_60.
.sm_60a.
.sm_61.
.sm_61a.
.sm_70.
.sm_75.
.sm_80.
.sm_86.
.sm_90.

These are available for CUDA 12.8 (12.9 is important to avoid some issues from 12.8 with newer GPUs sm_100 / sm_120, but shouldn't matter at build time for earlier PTX compute_80 for example, but may require runtime CUDA 12.8 / 12.9):

$ docker run --rm -it fedora:41
$ dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo

# NOTE: If building for CC 100/120+ prefer CUDA 12.9
$ dnf install cuda-nvcc-12-8

$ /usr/local/cuda-12.8/bin/nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120

https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/ for one - nvidia is stabilizing drivers and runtimes around the blackwell gear on the fly. The mezzanine change alone from a PCI complex to an IB interface ((MT4129 - 692-9X760-00SE-S00) Nvidia ConnectX-7 mezz internal for Nvidia Umbriel system - x4) set is significant and will likely impact the direction of CUDA and driver development moving forward.

Re nvcc - LLVM16 and 20 largely support the same x86 architectures as well but the binary product of compilation to those targets is not equivalent between them.

@sempervictus

sempervictus commented Jun 15, 2025

Copy link
Copy Markdown
Contributor Author

@polarathene thanks for the commits.
I'm a bit confused by the rest of the commentary however - is there some concern about stability or performance for which tests do not exist? This PR is attempting to broaden the range of compatible equipment by modernizing the build and runtime stacks, enabling NCCL with MPI during execution, and dropping the minimum viable CUDA compute target to 7.0 (its not eliminating 8+). There is a wall of text above and i am very time constrained so could you please summarize your concerns in a succinct manner regarding the proposed new defaults?

@polarathene

polarathene commented Jun 16, 2025

Copy link
Copy Markdown
Contributor

There is a wall of text above and i am very time constrained so could you please summarize your concerns in a succinct manner regarding the proposed new defaults?

I don't know your experience in this area (Docker builds or CUDA builds), but here's the gist of it:

  • ARG does not need to change it's default for CC 7.0 to be supported. You're approaching this wrong.
  • Hopper (sm_90) (already publishing via CI) / Blackwell GPU archs (sm_10x / sm_12x) should already be supported already with PTX (minus arch optimizations).

Your runtime container should provide a CUDA driver (libcuda.so.1) that is compatible with your kernel driver. That is unrelated to the version bump you're adding here.

  • For the runtime image this would embed additional cubins/PTX for CUDA libs (CuBLAS/CuDNN/etc).
  • For the builder image/stage, this would be relevant if building with CC 10.0 / 12.0, which affects building additional CUDA kernels from this project or it's deps (eg: FlashAttention). Otherwise the PTX provides forward compatibility.

If you need more clarity on either of those points, please go over the full response.

Original reply follows

Was the TLDR checklist not clear?

That should be fairly clear, I will try clarify the compute cap concern for you better...

CUDA_COMPUTE_CAP=70 does nothing meaningful as a default:

  • Reduces optimizations for newer GPUs (CC 8.0 - May 2020+) to expand the compatibility by 3 years.
    • Only when you're building the image yourself and relying on the default ARG.
    • Note that CC 7.5 (Sep 2018) is already supported/published. CC 7.0 (Dec 2017) can do the same, no need to make it a default officially (more maintenance burden from user expectations which would be better communicated if they mention reliance upon CC 7.0 as opt-in).
  • Shifts the expectation of using --build-arg CUDA_COMPUTE_CAP=<value> to anyone not using CC 7.0 (instead of the current CC 8.0), despite the fact that more users building mistral.rs will have Ampere/Ada GPUs (3xxx/4xxx) than Volta (V100) / Turing (2xxx).

Honestly, I'm not wanting to repeat myself because my feedback is too vague, or too verbose. If you want to push for making a change you don't understand, then please find the time to understand it.


I'm a bit confused by the rest of the commentary however - is there some concern about stability or performance for which tests do not exist?

I am not familiar with this project's test suite. The bulk of my feedback is discouraging your change to the default CC, defaults should be tailored for the majority.


This PR is attempting to broaden the range of compatible equipment by modernizing the build and runtime stacks

  • OLDER CC - Lowering the ARG in a Dockerfile is redundant. You want to have that handled in CI published images if you care about support being built/pushed to GHCR for you to pull, unless you want to do a local build instead.
  • NEWER CC - Raising the version of CUDA is ok, but without custom builds specifically targeting sm_100 / sm_120 cubins or the equivalent PTX CC, it should only be relevant for runtime improvements/support (notably for libs like CuBLAS with optimized kernels present).

Regarding the CUDA version bump:

  • I'm not against this change, just noting that it's not necessary at build time.
  • Without changing the CC at build-time, you can still benefit from CUDA 12.9 libs at runtime (eg: CuBLAS / CuDNN).
  • Runtime with a container should also be compatible with lower CUDA libs, as your host CUDA driver (libcuda.so.1) is mounted into the container (nvidia-smi should output 12.9 there). The runtime container image provides additional CUDA libraries that will have compatible PTX (or with the CUDA 12.9 bump, cubin kernels specialized for those newer archs).

@polarathene

Copy link
Copy Markdown
Contributor

nvidia is stabilizing drivers and runtimes around the blackwell gear on the fly.

As mentioned in previous reply. Those updates should occur on your container host which has the kernel driver and CUDA driver (user-space, libcuda.so.1) AFAIK.

During compilation in the builder image, you need to be using newer APIs from the driver/runtime API for any difference there, which should call into the mentioned container host deps during runtime.

Only other relevant difference is the CC target, which for Blackwell would be leveraging PTX of a lower CC currently. Bumping to a higher version of CUDA (and thus nvcc) which can support building for Blackwell arch's specifically is when those specific CC instructions would actually be leveraged.


So as far as stabilization is going, this bump shouldn't be necessary AFAIK.

You should get that implicitly (image agnostic). Only when you're running with kernels specialized to Blackwell instructions (CC adjusted for PTX/cubin at build) should you have build / runtime concerns where it's relevant.

@sempervictus

Copy link
Copy Markdown
Contributor Author

nvidia is stabilizing drivers and runtimes around the blackwell gear on the fly.

As mentioned in previous reply. Those updates should occur on your container host which has the kernel driver and CUDA driver (user-space, libcuda.so.1) AFAIK.

During compilation in the builder image, you need to be using newer APIs from the driver/runtime API for any difference there, which should call into the mentioned container host deps during runtime.

Only other relevant difference is the CC target, which for Blackwell would be leveraging PTX of a lower CC currently. Bumping to a higher version of CUDA (and thus nvcc) which can support building for Blackwell arch's specifically is when those specific CC instructions would actually be leveraged.

So as far as stabilization is going, this bump shouldn't be necessary AFAIK.

You should get that implicitly (image agnostic). Only when you're running with kernels specialized to Blackwell instructions (CC adjusted for PTX/cubin at build) should you have build / runtime concerns where it's relevant.

Thanks for the clarification.

The driver updates are of course on the container host as they're in ring0 past the boundary of a namespaced runtime; the commensurate CUDA updates however are what i am hoping to expose by presenting the 12.9 build and runtime environments.

To the point about bumping nvcc targets when specifically required - that's how i understand the ARG to be intended: not to prevent use by pre-cc8 systems but to enable explicit optimization for cc10 or the like as desired. I believe that's roughly the point you're making in your last sentence as well but i'm not sure how that would work if we didn't link against the relevant 12.9 bits at compile-time (symbols for newer functions, altered prototypes, etc wouldn't be known to be resolved dynamically at runtime without some dispatch magic).

I suppose my concern here is "do you see this PR introducing any problems for current use-cases or users who would get these defaults?"

@polarathene

Copy link
Copy Markdown
Contributor

the commensurate CUDA updates however are what i am hoping to expose by presenting the 12.9 build and runtime environments.

Sure, but like I said:

  • Build-time:
    • CC is relevant when building kernels (.cu) into PTX (.ptx) / Cubin (.cubin), which are often embedded into the built binary. Compilation will use nvcc + gcc which must support the CC target.
    • CUDA version should only matter here mostly for the API calls, but if newer ones are not called by the program (or dependent crates), then there shouldn't be any notable difference.
    • Newer nvcc / PTX compiler may result in more improvements/fixes, but the ones regarding Blackwell support with 12.9 addressing bugs from 12.8 are AFAIK only when targeting CC 10+.
  • Run-time:
    • PTX is built for specific CC virtual arch. It can build cubin at runtime for newer GPU archs like Blackwell, runtime libs just need to provide the CUDA support for the PTX CC version (eg: compute_90) and the target CC arch (sm_100/sm_120), but if the PTX CC is lower, it will stick to compiling Cubin with the reduced CC instruction set.
    • Linked libs when compatible ABI can provide support here for newer Cubins/PTX. Completely unrelated to build-time (unless you used CudaRC static-linking mode to inline libs into the binary at build-time).

To the point about bumping nvcc targets when specifically required - that's how i understand the ARG to be intended: not to prevent use by pre-cc8 systems but to enable explicit optimization for cc10 or the like as desired.

ARG default is for when you do docker build without providing your preferred CC such as docker build --build-arg CUDA_COMPUTE_CAP=70. Since that doesn't prevent you from building with a lower CC, there is no need to lower this default to the lowest possible CC.

CC 8.0 can build sm_80 (Ampere / A100) and be compatible with sm_86 (Ampere RTX 3xxx) + sm_89 (Ada RTX 4xxx) GPUs. The major generation is forward compatible to it's minors, so by default that provides the wider support for the more common GPUs that users will have, with the benefit of supporting BF16 (when used in a kernel / weights) which requires CC 8.0 to enable.

  • If you lower to CC 7.0 / 7.5, these users will not get the CC 8.0 capabilities at runtime from the PTX, instead they would need to be the ones to add --build-arg CUDA_COMPUTE_CAP=80.
  • sm_86 / sm_89 have their own optimizations that are only available when the CC is adjusted for the specific archs, such as for Ada over 2x FP32 throughput is implicit IIRC.

This is why we have the default set at CC 8.0, but CI will build mistral.rs releases for multiple CC when publishing the images. They container the PTX for that CC and the Cubin for that specific real arch (so for that GPU there is no need for JIT compilation of PTX => Cubin at runtime).

All I'm saying is that leaving the ARG at the default CC 8.0 makes more sense. Supporting lower CC should remain opt-in, and if you'd like official images pre-built for Volta / CC 7.0, then the correct change for that is to update the CI instead to build for that CC as well.

You will likely also want to have CI build for Blackwell CC 10.0 / CC 12.0. These CC also have some additional variations to be aware of if you're wanting to tune for performance or compatibility.


I believe that's roughly the point you're making in your last sentence as well but i'm not sure how that would work if we didn't link against the relevant 12.9 bits at compile-time (symbols for newer functions, altered prototypes, etc wouldn't be known to be resolved dynamically at runtime without some dispatch magic).

As stated, when it comes to the CUDA API, there should be no concerns here requiring CUDA 12.9 unless using newer API calls that require it. CUDA 12 series is forward compatible, the shared lib linked should look like .so.12 or similar.

The only other difference is the kernels embedded in libs like CuBLAS. Where newer PTX or Cubins for a GPU arch like Blackwell should result in improved performance. CUDA 12.8 had some bugs when building PTX for Blackwell, but PTX for earlier CC should have been fine as it wasn't using newer instructions.

I could be mistaken here, but without CUDA 12.8+ you could not build cubins for Blackwell, so the earlier PTX CC built would be reliant upon runtime having CUDA 12.8+ to JIT compile that PTX to cubin for Blackwell, and potentially with CUDA 12.8 that was buggy too. If that's the case, then yes CUDA 12.9 at runtime would make a difference when no pre-built CUDA 12.9 sm_100/sm_120 cubin was available at runtime (which presently would be the case with kernels mistral.rs builds PTX for, while CUDA 12.9 libs in the runtime image provide the rest with cubins).

Sorry for all the repetition, I feel CUDA compatibility is a bit of a messy topic to explain well when it comes to nuances like these 😓

@polarathene

polarathene commented Jun 17, 2025

Copy link
Copy Markdown
Contributor

I suppose my concern here is "do you see this PR introducing any problems for current use-cases or users who would get these defaults?"

No, for most users I think it'll be fine. Anyone doing local builds relying on defaults may experience a perf regression.

I see no issues for anyone using the images the project publishes to GHCR. Those are published with the CC as the tag. I just think you should leave the CC as-is, as I don't think anyone is complaining about the default CC for builds (specific to image builds).

AFAIK, raising the CUDA version for build and runtime stages of the Dockerfile should also be fine as the major version is the same. Even if using the binary outside of the container, I think CUDA 12 is still fairly backwards compatible, but for the image published, the only thing that really changes there is libcuda.so.1.

I could do a build to verify on my 4060 with CUDA 12.4 if that'd help?


UPDATE: There may be some risk as described here: chelsea0x3b/cudarc#253 (comment)

I will need to make time to try reproduce.

@sempervictus

Copy link
Copy Markdown
Contributor Author

I suppose my concern here is "do you see this PR introducing any problems for current use-cases or users who would get these defaults?"

No, for most users I think it'll be fine. Anyone doing local builds relying on defaults may experience a perf regression.

I see no issues for anyone using the images the project publishes to GHCR. Those are published with the CC as the tag. I just think you should leave the CC as-is, as I don't think anyone is complaining about the default CC for builds (specific to image builds).

AFAIK, raising the CUDA version for build and runtime stages of the Dockerfile should also be fine as the major version is the same. Even if using the binary outside of the container, I think CUDA 12 is still fairly backwards compatible, but for the image published, the only thing that really changes there is libcuda.so.1.

I could do a build to verify on my 4060 with CUDA 12.4 if that'd help?

UPDATE: There may be some risk as described here: coreylowman/cudarc#253 (comment)

I will need to make time to try reproduce.

I suppose my concern here is "do you see this PR introducing any problems for current use-cases or users who would get these defaults?"

No, for most users I think it'll be fine. Anyone doing local builds relying on defaults may experience a perf regression.

I see no issues for anyone using the images the project publishes to GHCR. Those are published with the CC as the tag. I just think you should leave the CC as-is, as I don't think anyone is complaining about the default CC for builds (specific to image builds).

AFAIK, raising the CUDA version for build and runtime stages of the Dockerfile should also be fine as the major version is the same. Even if using the binary outside of the container, I think CUDA 12 is still fairly backwards compatible, but for the image published, the only thing that really changes there is libcuda.so.1.

I could do a build to verify on my 4060 with CUDA 12.4 if that'd help?

UPDATE: There may be some risk as described here: coreylowman/cudarc#253 (comment)

I will need to make time to try reproduce.

Thanks for the clarification - performance regressions/improvements are definitely something i'd love to test but i think we would need some sort of opt-in/explicit submission performance registry/scoreboard for the project to make that idea viable (i have access to far more hardware than most but i can't really be running benchmarks on client kit without written authorization/them writing off the the GPU hours involved).

In terms of the default published images, that's really what i'm shooting for here - broaden adoption. There is a huge amount of pre-bf16 metal out there and most people spinning up a container to test a runtime will just move on if it doesnt work out the gate. I think we'll get more adoption by having a broadly accepted image with clear instructions for how to rebuild for local kit (some detection along the lines of native architecture optimization/detection in conventional compilers should ease that process) than if users turn away at first glance.

In regards to chelsea0x3b/cudarc#253 (comment) - mixing and matching build/runtime should be possible with stable ABI but we've seen an immense amount of issues over the years with just source-built GPGPU software stacks (OFED, NV, NCCL, UCX, GDRCOPY, all that fun stuff) having access to each others source files at build time to say "dynamic linking works 100% correctly across the CPU/GPU ABIs." The move to DOCA and the IB-based interconnects we're seeing on the B series do not give me the warm and fuzzy for this concern in the near term :-(. I see mention of WSL there as well - my win-fu is a bit out of date past figuring out how to bypass EDR and get whatever flag's the objective, but the OS-specific driver code is more or less a framing shim for the actual driver in the device firmware and "unless all data types and structures line up perfectly between windows and linux implementations" (they do not), there will be ABI mismatch like signedness, precision, or casting outcomes which messes things up even more. I dont mean to come off as "debbie downer" on this stuff - i think the dynamic linking approach is great and the way software should work in a perfect world; i've just been dealing with this video card driver binary->code->back stuff since the 90s and "quality improvement has been incremental while complexity growth exponential."

@polarathene

Copy link
Copy Markdown
Contributor

performance regressions/improvements are definitely something i'd love to test but i think we would need some sort of opt-in/explicit submission performance registry/scoreboard for the project to make that idea viable

Yeah it might be a bit more work if mistral-bench isn't sufficient. The last time I was involved in this project it was only focused on LLMs (just prior to LLMs with vision support), there is quite a few permutations to go over for that 😓

It might be easier with Burn instead of mistral.rs for such testing. They've got a more thorough benchmark suite setup with submissions and leader-board/results webpage covering various individual operations. However Burn takes a different approach to Candle AFAIK, doing JIT compilation at runtime so it may not help 😅

Easiest test we can do is the impact of CC version on CUDA kernel compilation from builds, and additionally compare runtime CUDA libs linked (CuBLAS, CuDNN, etc) in the same way ideally. mistral-bench might be sufficient for that and CC 7.0 vs CC 12.0 is perhaps wide enough of a gap that you might get a notable performance difference, but in both cases Blackwell should be able to run the two builds.


In terms of the default published images, that's really what i'm shooting for here - broaden adoption.

Oh, okay well that's quite simple but like I mentioned adjusting the default for ARG CUDA_COMPUTE_CAP in the Dockerfile has no bearing on that. We would just need to update CI here:

jobs:
build-and-push-image:
strategy:
matrix:
compute_capability: [75, 80, 86, 89, 90]

I'm presently still getting familiar with building/optimizing images for CUDA and related CI publishing workflow for that. This project is one of a few I'll be contributing PRs to once I've become confident enough in that area, but I think some upstream crates might need some PRs raised from what I've noticed thus far in my research on the topic, or at least for more optimal builds at least.

I do want to look into this concern about bumping CUDA to 12.9 however, I should have an answer for that in the next few days when time allows 👍


There is a huge amount of pre-bf16 metal out there and most people spinning up a container to test a runtime will just move on if it doesnt work out the gate.

I'm not disregarding that. But like I've cited previously, some CUDA kernels may not have fallback support.

AFAIK that will result in a compile time failure, the other compatibility issue can be with CUDA toolkit support (If you go as far back as CC 5.0 / Maxwell, that is CUDA 12.0+). sm_50 is the lowest NVCC will compile for with CUDA 12.x, I assume that's sufficient though? That GPU arch lacks FP16 support and VRAM is often low IIRC, so I don't know how viable it is but as long as it compiles it could be published.

Personally when I publish builds, I try to not be wasteful publishing support for builds that have no (or minimal) demand. That's just a waste of resources/energy (even if it doesn't cost me directly). I tend to defer users to documentation for custom builds until enough actual user demand is expressed to warrant broader official builds.

Presently, the bindgen_cuda crate (which compiles CUDA kernels) seems to be limited to single CC target builds. One project is building several binaries as a result with a startup script to detect which one has the highest compatible CC. Ideally this crate would allow configuring multiple PTX/Cubin kernel builds. Until then this is how that project distributes a CUDA image that provides multiple optimized builds.

I'm not sure how best to approach this for users UX when it comes to selecting a CUDA published image, while I think most can select a CUDA versioned tag, what actually matters is the image has built support for their CC or at least provides PTX with CC that is compatible.

If we had a default :cuda "latest" tag, your concern would be better addressed there perhaps with the lower CC? (but that may also give the impression of mistral.rs as poor in performance compared to alternatives if it regresses too much)

For the most part though, they just need to check their CC and choose a CC tag that matches their GPU CC, or the next lowest if that's not available.

Should be easy to document (verified unaffected by using CUDA compat libs):

$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
8.9

I think we'll get more adoption by having a broadly accepted image with clear instructions for how to rebuild for local kit (some detection along the lines of native architecture optimization/detection in conventional compilers should ease that process) than if users turn away at first glance.

I haven't looked at this for context outside of Docker, but I do know the Docker related docs are lacking and would like to see that improved too.

When building outside of a container, there is already some detection (similar to the nvidia-smi example above) when not configuring the related ENV being used in the container (at build time nvidia-smi is not available, nor should it be really...).

As you've stated about broader compatibility, containers often are built without native CPU instruction optimizations since that can result in incompatibility on other systems. With CC version though it seems it may have a worthwhile impact to ensure the image is built with an appropriate CC, so that should be documented (could potentially take that nvidia-smi output as a build arg when viable).


In regards to chelsea0x3b/cudarc#253 (comment) - mixing and matching build/runtime should be possible with stable ABI but we've seen an immense amount of issues over the years with just source-built GPGPU software stacks

I can't comment much on that.

I am aware that compatibility was improved from CUDA 12, and that this will be further addressed with CUDA 13 (there's a blog post detailing it and how some changes might be breaking regarding defaults adjusted with nvcc flags).

Some issues I've come across are misconfiguration at user/developer ends from inexperience, which is fair enough (just look at the discussion we're having on the topic ha). I've learned quite a bit myself in recent weeks despite the 20 years I've had as a dev 😅


I see mention of WSL there as well - my win-fu is a bit out of date past figuring out how to bypass EDR and get whatever flag's the objective, but the OS-specific driver code is more or less a framing shim for the actual driver in the device firmware and "unless all data types and structures line up perfectly between windows and linux implementations" (they do not), there will be ABI mismatch like signedness, precision, or casting outcomes which messes things up even more.

I need reproducible examples for such personally. I'm not running into any WSL2 related issues for linux compatibility or with the CUDA driver thus far.

The way a container is provided access to the GPU was something I was more concerned about, and how that played with CUDA version of runtime images, but I think I have a handle on that now.

@polarathene

polarathene commented Jun 19, 2025

Copy link
Copy Markdown
Contributor

I dont mean to come off as "debbie downer" on this stuff - i think the dynamic linking approach is great and the way software should work in a perfect world; i've just been dealing with this video card driver binary->code->back stuff since the 90s and "quality improvement has been incremental while complexity growth exponential."

I am not sure what you're trying to say here. cudarc crate provides us 3 options:

  • Dynamic Loading:
    Links to CUDA libs are not explicit, thus they're not required at runtime but the lack of visibility makes me uncomfortable with this approach.
  • Dynamic Linking:
    You'll get a clear failure if a required library is missing at runtime, preventing init of your program.
  • Static Linking:
    Unlike a full static build, this is only specifically for embedding CUDA libs while still enforcing a symlink to libcuda.so.1 (CUDA runtime) - which AFAIK prevents static binaries, some of the CUDA libraries also rely on calls like dlopen() anyway it seems so glibc shouldn't be statically linked into the binary.

Additional pros/cons of each linking method

Dynamic Loading:

  • Only seems relevant if intending to optionally support CUDA at runtime, such as having ROCm/CPU only support instead of separate builds.
  • There's also some flexibility in how it resolves libraries, but that may be more prone to mistakes too from what I'm seeing 🤔
  • The other perk is the build environment doesn't require any CUDA deps installed. So long as you control the runtime environment that can be convenient. This is strictly for cudarc mind you, when there are CUDA kernels to build (bindgen_cuda crate) you're going to need NVCC, otherwise defer to runtime with NVRTC.

Dynamic Linking:

  • Can avoid the build environment requiring the weighty CUDA libs, as nvidia does provide link-time stubs, but generally getting those stub files in the first place involves downloading the devel package for each lib 😓 So for many users that may seem redundant.
  • Some libraries are being linked unnecessarily depending on features, but this could be resolved with a PR to cudarc.
  • Far more pleasant as an end-user to enure compatible libs in the environment than with dynamic loading.

Static Linking:

  • Since the project distributes such builds by publishing a container image, I'm not sure there is much benefit from static linking approach as all required libs will be present.
  • Beyond convenience for runtime not needing to acquire the libs (better UX for some users outside of a container), size can be optimized further by pruning away redundant CC kernels PTX/ELF(cubin) from these static libs prior to linking. This however has some caveats when the target CC is not a major version (minor zero), and I've already seen projects make this mistake.
  • Additionally requires libstdc++.a from GCC, but is linked by rustc call, so you cannot use LIBRARY_PATH ENV to find the lib, nor is it a default system one. Requires RUSTFLAGS='-L path/to/libstdc++.a' which varies by build host, adding additional friction.
  • As mentioned earlier, CUDA prevents fully static builds and is reliant upon glibc symbols, no musl compatibility. This sometimes complicates distribution / portability, but the biggest portability issue beyond CUDA concerns itself is probably min glibc version from the build host, which can be lowered if building with help of Zig instead (can target lower glibc).

Slight disadvantage for dynamic linking/loading vs static linking:

  • .so libs cannot be pruned for specific CC as a disk-size optimization.
    • This is less relevant if the runtime image providing CUDA libs was shared across multiple CUDA oriented projects, however that's only really the case when you custom build for each (even if each individual project used a common base image for runtime, the SHA digest may differ at build/publish?).
    • CUDA runtime images can be costly in storage requirements, so this is sometimes an important optimization.

I think that covers the choices well and I'm leaning towards dynamic linking.

At least for personal Dockerfile (or anyone that would trust pulling from images I publish), it'd be possible to build/publish an image that packages just the stubs and any other minimal build requirements. I'm personally not a fan of 7GB+ builder images when you can accomplish the same with 10% of that, faster CI too.

@sempervictus

sempervictus commented Jun 19, 2025

Copy link
Copy Markdown
Contributor Author

I dont mean to come off as "debbie downer" on this stuff - i think the dynamic linking approach is great and the way software should work in a perfect world; i've just been dealing with this video card driver binary->code->back stuff since the 90s and "quality improvement has been incremental while complexity growth exponential."

I am not sure what you're trying to say here. cudarc crate provides us 3 options:

  • Dynamic Loading:
    Links to CUDA libs are not explicit, thus they're not required at runtime but the lack of visibility makes me uncomfortable with this approach.
  • Dynamic Linking:
    You'll get a clear failure if a required library is missing at runtime, preventing init of your program.
  • Static Linking:
    Unlike a full static build, this is only specifically for embedding CUDA libs while still enforcing a symlink to libcuda.so.1 (CUDA runtime) - which AFAIK prevents static binaries, some of the CUDA libraries also rely on calls like dlopen() anyway it seems so glibc shouldn't be statically linked into the binary.

Additional pros/cons of each linking method

Dynamic Loading:

  • Only seems relevant if intending to optionally support CUDA at runtime, such as having ROCm/CPU only support instead of separate builds.
  • There's also some flexibility in how it resolves libraries, but that may be more prone to mistakes too from what I'm seeing 🤔
  • The other perk is the build environment doesn't require any CUDA deps installed. So long as you control the runtime environment that can be convenient. This is strictly for cudarc mind you, when there are CUDA kernels to build (bindgen_cuda crate) you're going to need NVCC, otherwise defer to runtime with NVRTC.

Dynamic Linking:

  • Can avoid the build environment requiring the weighty CUDA libs, as nvidia does provide link-time stubs, but generally getting those stub files in the first place involves downloading the devel package for each lib 😓 So for many users that may seem redundant.
  • Some libraries are being linked unnecessarily depending on features, but this could be resolved with a PR to cudarc.
  • Far more pleasant as an end-user to enure compatible libs in the environment than with dynamic loading.

Static Linking:

  • Since the project distributes such builds by publishing a container image, I'm not sure there is much benefit from static linking approach as all required libs will be present.
  • Beyond convenience for runtime not needing to acquire the libs (better UX for some users outside of a container), size can be optimized further by pruning away redundant CC kernels PTX/ELF(cubin) from these static libs prior to linking. This however has some caveats when the target CC is not a major version (minor zero), and I've already seen projects make this mistake.
  • Additionally requires libstdc++.a from GCC, but is linked by rustc call, so you cannot use LIBRARY_PATH ENV to find the lib, nor is it a default system one. Requires RUSTFLAGS='-L path/to/libstdc++.a' which varies by build host, adding additional friction.
  • As mentioned earlier, CUDA prevents fully static builds and is reliant upon glibc symbols, no musl compatibility. This sometimes complicates distribution / portability, but the biggest portability issue beyond CUDA concerns itself is probably min glibc version from the build host, which can be lowered if building with help of Zig instead (can target lower glibc).

Slight disadvantage for dynamic linking/loading vs static linking:

  • .so libs cannot be pruned for specific CC as a disk-size optimization.

    • This is less relevant if the runtime image providing CUDA libs was shared across multiple CUDA oriented projects, however that's only really the case when you custom build for each (even if each individual project used a common base image for runtime, the SHA digest may differ at build/publish?).
    • CUDA runtime images can be costly in storage requirements, so this is sometimes an important optimization.

I think that covers the choices well and I'm leaning towards dynamic linking.

At least for personal Dockerfile (or anyone that would trust pulling from images I publish), it'd be possible to build/publish an image that packages just the stubs and any other minimal build requirements. I'm personally not a fan of 7GB+ builder images when you can accomplish the same with 10% of that, faster CI too.

Apologies, tapped wrong button :-\.

The reason i harp on application binary interfaces on NV's side (specifically for Linux anyway) lies in the way they "make their GPL condom" - they hijack kernel functions in their modules hitting a few AT&K items along the way. Exactly the sort of nonsense Rust tries to intrinsically avoid (despite not actually building CFI checks yet when i last checked through LLVM's underlying logic). Even though the higher-level libraries may be well-designed, have proper interfaces, and well-versioned API/ABI - further down the call stack it all calls into "some questionable code" generally resulting in the smallest amount of divergence being between two codebases that were "built together" (575/12.9, 570/12.8, etc). That said, i dont think anyone's going to stub their toe on 575 and 12.8 or 570 and 12.9 but i wouldn't suggest running 3xx series drivers with this :-)

@polarathene

Copy link
Copy Markdown
Contributor

The issue you link is for the kernel driver (nvidia.ko?), that's separate from CUDA user-space driver (libcuda.so.1). I don't have time to digest that entire issue referenced, but I assume it's nothing to do with the CUDA runtime driver but is dependent upon the GPU kernel-space driver?

  • Nvidia provides a page in their docs about the kernel + runtime driver compatibility, and notes the associated version of CUDA and where you'll need the CUDA compat package for support of newer CUDA runtime on older kernel drivers.
  • I've demonstrated and documented this with nvidia-smi output at the end of this comment. My host driver is 550 series with CUDA 12.4, yet I ran compat packages in a container to demonstrate CUDA 12.9.

Slight observation regarding runtime libs .so vs .a considerations

As we're discussing compatibility concerns, I assume this is a temporary blunder by nvidia's CI (applies to the lib package on Fedora and Ubuntu, on official images with different versions of Ubuntu/CUDA).

I haven't looked at all CUDA libs, but today I noticed cuFFT has only sm_52.cubin in the .so for dynamic linking (or dynamic loading), no PTX.

This does not affect this project, since cudarc does not have support added for this library, but:

Collapsed for brevity
# Get the cuFFT CUDA library and inspect it:
$ dnf install libcufft-devel-12-9

$ cuobjdump --list-elf --list-ptx /usr/local/cuda/lib64/libcufft_static.a | grep -oE 'sm_[0-9]*[a-z]?\.(cubin|ptx)' | sort -u --version-sort

sm_50.cubin
sm_50.ptx
sm_60.cubin
sm_70.cubin
sm_80.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_120.cubin

# The `.so` lib (which is not compatible with nvprune) provided oddly only has `sm_52.cubin`:
$ cuobjdump --list-elf --list-ptx /usr/local/cuda/lib64/libcufft.so | grep -oE 'sm_[0-9]*[a-z]?\.(cubin|ptx)' | sort -u --version-sort

cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libcufft.so'. You may try with -all option.
sm_52.cubin

$ du --bytes --si /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcufft.so
290M    /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcufft.so

$ du --bytes --si /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcufft_static.a
298M    /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcufft_static.a

Could be a bug in cuobjdump given the disk size, only way to find out I guess is to write a program that uses that library. It also has the PTX embedded for lowest supported CC 5.0, thus newer GPUs would not be able to use F16 or BF16 optimizations, even when using the static linking method.

Meanwhile libcublas.so is in better shape, same cubins/ptx as the static lib, and the PTX is for the latest CC supported:

# NOTE: This adds 2GB to disk:
$ dnf install libcublas-devel-12-9

# Compatibility of kernels listed is equivalent to the static lib:
$ cuobjdump --list-elf --list-ptx /usr/local/cuda/lib64/libcublas.so | grep -oE 'sm_[0-9]*[a-z]?\.(cubin|ptx)' | sort -u --version-sort

sm_50.cubin
sm_50a.cubin
sm_60.cubin
sm_60a.cubin
sm_61.cubin
sm_61a.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_90.cubin
sm_100.cubin
sm_120.cubin
sm_120.ptx

I have seen another Rust project with a user report about worse performance than PyTorch (which builds and bundles their own copy of CUDA libs to distribute), I wonder if it's related to these differences 😅


UPDATE: The following checks each lib:

ls -1 /usr/local/cuda/lib64/*.so | xargs -n1 bash -c 'echo "File: ${0}" && cuobjdump --list-elf --list-ptx ${0} | grep -oE "sm_[0-9]*[a-z]?\.(cubin|ptx)" | sort -u --version-sort && echo "---"'

Doesn't look like any of it should be concerning, slight differences to keep in mind. There are additional libs with only sm_52 cubin, but they do not appear relevant to cudarc.

There is cufile (GPU direct storage API, not supported by cudarc yet), that seems to exclude > CC 8.6 (provides CC 8.0 PTX for foward compatibility) and min CC 6.0 (feature introduced).

Command output for reference (collapsed for brevity)
File: /usr/local/cuda/lib64/libaccinj64.so
cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libaccinj64.so'. You may try with -all option.
sm_52.cubin
---
File: /usr/local/cuda/lib64/libcheckpoint.so
cuobjdump info    : File '/usr/local/cuda/lib64/libcheckpoint.so' does not contain device code
---
File: /usr/local/cuda/lib64/libcublas.so
sm_50.cubin
sm_50a.cubin
sm_60.cubin
sm_60a.cubin
sm_61.cubin
sm_61a.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_90.cubin
sm_100.cubin
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcublasLt.so
sm_50.cubin
sm_52.ptx
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_90a.cubin
sm_100.cubin
sm_103.cubin
sm_120.cubin
sm_120.ptx
sm_121.cubin
---
File: /usr/local/cuda/lib64/libcudart.so
cuobjdump info    : File '/usr/local/cuda/lib64/libcudart.so' does not contain device code
---
File: /usr/local/cuda/lib64/libcufft.so
cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libcufft.so'. You may try with -all option.
sm_52.cubin
---
File: /usr/local/cuda/lib64/libcufftw.so
cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libcufftw.so'. You may try with -all option.
sm_52.cubin
---
File: /usr/local/cuda/lib64/libcufile.so
sm_60.cubin
sm_61.cubin
sm_62.cubin
sm_70.cubin
sm_72.cubin
sm_75.cubin
sm_80.cubin
sm_80.ptx
sm_86.cubin
---
File: /usr/local/cuda/lib64/libcufile_rdma.so
cuobjdump info    : File '/usr/local/cuda/lib64/libcufile_rdma.so' does not contain device code
---
File: /usr/local/cuda/lib64/libcuinj64.so
cuobjdump info    : File '/usr/local/cuda/lib64/libcuinj64.so' does not contain device code
---
File: /usr/local/cuda/lib64/libcupti.so
cuobjdump info    : File '/usr/local/cuda/lib64/libcupti.so' does not contain device code
---
File: /usr/local/cuda/lib64/libcurand.so
sm_50.cubin
sm_60.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libcusolver.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_80.ptx
sm_86.cubin
sm_86.ptx
sm_89.cubin
sm_89.ptx
sm_90.cubin
sm_90.ptx
sm_100.cubin
sm_100.ptx
sm_101.cubin
sm_101.ptx
sm_103.cubin
sm_103.ptx
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcusolverMg.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_80.ptx
sm_86.cubin
sm_86.ptx
sm_89.cubin
sm_89.ptx
sm_90.cubin
sm_90.ptx
sm_100.cubin
sm_100.ptx
sm_101.cubin
sm_101.ptx
sm_103.cubin
sm_103.ptx
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcusparse.so
sm_50.cubin
sm_50.ptx
sm_52.cubin
sm_52.ptx
sm_60.cubin
sm_60.ptx
sm_61.cubin
sm_61.ptx
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_90.ptx
sm_100.cubin
sm_100.ptx
sm_101.cubin
sm_101.ptx
sm_103.cubin
sm_103.ptx
sm_120.cubin
sm_120.ptx
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppc.so
cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libnppc.so'. You may try with -all option.
sm_52.cubin
---
File: /usr/local/cuda/lib64/libnppial.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppicc.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppidei.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppif.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppig.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppim.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppist.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppisu.so
cuobjdump info    : No PTX file found to extract from '/usr/local/cuda/lib64/libnppisu.so'. You may try with -all option.
sm_52.cubin
---
File: /usr/local/cuda/lib64/libnppitc.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnpps.so
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnvJitLink.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvJitLink.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvblas.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvblas.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvfatbin.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvfatbin.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvjpeg.so
sm_50.cubin
sm_52.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnvperf_host.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvperf_host.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvperf_target.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvperf_target.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc-builtins.alt.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc-builtins.alt.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc-builtins.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc-builtins.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc.alt.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc.alt.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc.so' does not contain device code
---
File: /usr/local/cuda/lib64/libnvtx3interop.so
cuobjdump info    : File '/usr/local/cuda/lib64/libnvtx3interop.so' does not contain device code
---
File: /usr/local/cuda/lib64/libpcsamplingutil.so
cuobjdump info    : File '/usr/local/cuda/lib64/libpcsamplingutil.so' does not contain device code
---

Equivalent for static libs (.a):

File: /usr/local/cuda/lib64/libcublasLt_static.a
sm_50.cubin
sm_52.ptx
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_90a.cubin
sm_100.cubin
sm_103.cubin
sm_120.cubin
sm_120.ptx
sm_121.cubin
---
File: /usr/local/cuda/lib64/libcublas_static.a
sm_50.cubin
sm_50a.cubin
sm_60.cubin
sm_60a.cubin
sm_61.cubin
sm_61a.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_90.cubin
sm_100.cubin
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcudadevrt.a
sm_50.cubin
sm_52.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libcudart_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcudart_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcufft_static.a
sm_50.cubin
sm_50.ptx
sm_60.cubin
sm_70.cubin
sm_80.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_120.cubin
---
File: /usr/local/cuda/lib64/libcufft_static_nocallback.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcufft_static_nocallback.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcufftw_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcufftw_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcufile_rdma_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcufile_rdma_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcufile_static.a
sm_60.cubin
sm_61.cubin
sm_62.cubin
sm_70.cubin
sm_72.cubin
sm_75.cubin
sm_80.cubin
sm_80.ptx
sm_86.cubin
---
File: /usr/local/cuda/lib64/libcufilt.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcufilt.a' does not contain device code
---
File: /usr/local/cuda/lib64/libculibos.a
cuobjdump info    : File '/usr/local/cuda/lib64/libculibos.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcupti_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcupti_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcurand_static.a
sm_50.cubin
sm_60.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libcusolver_lapack_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcusolver_metis_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libcusolver_metis_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libcusolver_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_80.ptx
sm_86.cubin
sm_86.ptx
sm_89.cubin
sm_89.ptx
sm_90.cubin
sm_90.ptx
sm_100.cubin
sm_100.ptx
sm_101.cubin
sm_101.ptx
sm_103.cubin
sm_103.ptx
sm_120.cubin
sm_120.ptx
---
File: /usr/local/cuda/lib64/libcusparse_static.a
sm_50.cubin
sm_50.ptx
sm_52.cubin
sm_52.ptx
sm_60.cubin
sm_60.ptx
sm_61.cubin
sm_61.ptx
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_90.ptx
sm_100.cubin
sm_100.ptx
sm_101.cubin
sm_101.ptx
sm_103.cubin
sm_103.ptx
sm_120.cubin
sm_120.ptx
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libmetis_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libmetis_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnppc_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnppc_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnppial_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppicc_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppidei_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppif_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppig_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppim_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppist_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnppisu_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnppisu_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnppitc_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnpps_static.a
sm_50.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnvJitLink_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvJitLink_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvfatbin_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvfatbin_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvjpeg_static.a
sm_50.cubin
sm_52.cubin
sm_60.cubin
sm_61.cubin
sm_70.cubin
sm_75.cubin
sm_80.cubin
sm_86.cubin
sm_89.cubin
sm_90.cubin
sm_100.cubin
sm_101.cubin
sm_103.cubin
sm_120.cubin
sm_121.cubin
sm_121.ptx
---
File: /usr/local/cuda/lib64/libnvperf_host_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvperf_host_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvptxcompiler_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvptxcompiler_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc-builtins_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc-builtins_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc-builtins_static.alt.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc-builtins_static.alt.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc_static.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc_static.a' does not contain device code
---
File: /usr/local/cuda/lib64/libnvrtc_static.alt.a
cuobjdump info    : File '/usr/local/cuda/lib64/libnvrtc_static.alt.a' does not contain device code
---

@polarathene polarathene left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the Dockerfile has changed a bit, and the discussion here is rather long-winded, you may want to progress with a new PR(s)?

Current feedback:

  • The nccl feature addition:
    • The nccl symlink seems redundant? Remove it.
    • Adding the nccl feature to the build image + openmpi-bin package to the runtime image might be better handled in a separate PR?
  • Bumping the CUDA version for:
    • The build image is lacking evidence of a meaningful change
      • Introduces a regression? (PTX cannot be used by an earlier CUDA runtime, even if CC for that PTX were compatible)
      • EDIT: Relevance would be for Blackwell when building kernels for CC 10.0 + 12.0, instead of relying on PTX from earlier build.
    • The runtime image should be ok. This will provide official CUDA libs linked with Blackwell cubins + PTX for potentially better performance. You may prefer to use 12.9.1 here as Nvidia doesn't seem to publish a 12.9 (major + minor only) tag.
  • Both build and runtime images could bump the base distro from Ubuntu 22.04 to 24.04. That should be fine given the runtime environment would be compatible, it's only the GPU driver that's a concern.

Comment thread Dockerfile.cuda-all
Comment on lines +77 to +78
# Only the `devel` builder image provides symlinks, restore the `libnccl.so` symlink:
RUN ln -s libnccl.so.2 /usr/lib/x86_64-linux-gnu/libnccl.so

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, could you please verify on your end and report any failure details if it is?


ldd /usr/local/bin/mistralrs-server
        linux-vdso.so.1 (0x00007fff58725000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007e24e2600000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007e24e2382000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007e24e22d8000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007e24e1dc5000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007e24e1d97000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007e24e1cae000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007e24e1a9c000)
        /lib64/ld-linux-x86-64.so.2 (0x00007e24e9db5000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007e24e28b5000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007e24e1a97000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007e24e1a92000)

The unversioned symlinks should only be necessary for the build stage, as the linker searches for these, and then from the symlinked file resolved, retrieves the DT_SONAME entry for what to link against at runtime.

In fact the whole earlier RUN for creating unversioned symlinks in the runtime image is unnecessary. I can only assume this was the case with dynamic-loading in the past not knowing any better. Typically these unversioned lib symlinks don't exist on production systems.

Comment thread Dockerfile.cuda-all Outdated
# syntax=docker/dockerfile:1

FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder
FROM nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04 AS builder

@polarathene polarathene Jun 21, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR:

  • Raising this minor version risks regressions for runtime usage when the host CUDA driver is older than the CUDA used for build.
  • I don't think we should do this without evidence that can be reproduced to justify it.

I have since learned this makes PTX compiled during build incompatible at runtime with any CUDA version older than used with nvcc.

That is, if you build with CUDA 12.9, but your Blackwell RTX 5xxx GPU at runtime is using CUDA 12.8 libcuda.so.1, then it cannot use the PTX built from CUDA 12.9, even when the CC was the same CC as your GPU or lower.

If your build environment was an older or equal version of CUDA as your runtime, then the PTX would be compatible. As such by raising the build version too high, you are introducing a regression for PTX usage that should otherwise be viable, without any evidence to demonstrate a meaningful benefit. From what I've seen this type of problem contributes to less than obvious bug reports to projects where troubleshooting is problematic due to lack of awareness about this type of compatibility concern.

For reference, when using the runtime image CUDA 12.9 compat package libcuda.so.1 via LD_LIBRARY_PATH=/usr/local/cuda/compat, while this has a noticeable effect with nvidia-smi --version on the CUDA version, on my system (RTX 4060 / sm_89 + CUDA 12.4) the CUDA programs I've tried (basic kernels with just nvcc or cudarc), the compat lib prevents the program running stating no CUDA device detected.

This change will thus only allow major version forward compatibility within the same cubin CC built. However since bindgen_cuda presently lacks building for more than one CC, projects like mistral.rs and TEI are building/publishing multiple bins as a workaround, which both currently can leverage the PTX for newer GPU support.

Comment thread Dockerfile.cuda-all
# Rayon threads are limited to minimize memory requirements in CI, avoiding OOM
# Rust threads are increased with a nightly feature for faster compilation (single-threaded by default)
ARG CUDA_COMPUTE_CAP=80
ARG CUDA_COMPUTE_CAP=70

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, this change seems unnecessary. Resolve it via CI + documentation instead.

Just add 70 to this list to get an image build for CC 7.0:

compute_capability: [75, 80, 86, 89, 90]

@polarathene

Copy link
Copy Markdown
Contributor

From the recent commit message for better visibility:

Specify image source repository in Podman-compatible syntax - they do not presume docker.io prefix for images but do support NV GPU toolkit through CDI.

Expand the cuda-all Dockerfile with MKL support:

  1. Intel repo setup for build and runtime containers
  2. Dev and runtime tool deployment for relevant steps of the build.

This may be more appropriate in some sort of Dockerfile.accel-all instead of cuda-all since its an acceleration library which can use NV cards but seems intended to target Intel iGPU.

  • 👍 Indicating the image registry for Container Engines that lack a default registry to fallback on is fine
  • 👎 Intel/MKL is a separate support case and probably should live in it's own image. CUDA tends to be large enough as it is, and those interested in MKL support likely aren't interested in all the weight of the CUDA image either? Keeping it split can be better for maintenance, see the issues over at TEI repo where someone introduced MKL x86_64 to the default CPU Dockerfile and problems that is causing for environments like Apple aarch64 devices.

@sempervictus

Copy link
Copy Markdown
Contributor Author

@EricLBuehler should we include flash-attn in the build? Will it be automatically disabled for V7 or does it become a runtime dependency?

This was referenced Jul 16, 2025
@sempervictus sempervictus force-pushed the feature/cuda_docker_update branch 3 times, most recently from ce7caf3 to 4d45bfd Compare July 24, 2025 03:11
@sempervictus

Copy link
Copy Markdown
Contributor Author

With the recent 12.9 enablement, this should now build correctly to the local target using docker build . -f Dockerfile.cuda-all -t mistralrs:$(today)-all --build-arg CUDA_COMPUTE_CAP="$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits | sed 's|\.||')"

@polarathene - on the CC70/80/100 default value aspect of this: short of doing matrix builds to see which ones can still run which models on which HW, is there a good mechanical way (even if we have to have cited references by some research-prompted model to the NV docs/CUDA sources for posterity) to figure out which default will give us the broadest coverage without forcing data types or associated features to be inlined where they might crash on an older system without that capability?

@polarathene

Copy link
Copy Markdown
Contributor

to figure out which default will give us the broadest coverage without forcing data types or associated features to be inlined where they might crash on an older system without that capability?

Most projects in this space that I've come across seem to have CC 7.5 as the lowest or even CC 8.0. It can be as low as you can support, but that introduces regressions in performance AFAIK when the hardware is more capable.

I don't have the hardware to compare personally, but I also don't see an issue with those on older or newer hardware to need a default build support when all they need to do is adjust the image tag? The CI already has matrix builds to build for different CC and tag them.

Just like how we have many distros moving forward with building CPU packages for x86_64 targeting the v3 level as a default, but not the newer v4.

@polarathene polarathene left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you added && \ to the end of the package install commands? Adding && after apt-get install is unnecessary, what actual problem are you trying to solve? Please revert that, there is a better way to resolve your concern (FWIW, if you were to bring back &&, you weren't covering apt-get update potentially failing either, and if you did we lose most benefits from HereDoc usage, vs fixing this properly by adding SHELL).

You shouldn't need to create the nccl symlink either? Please remove that addition after verifying your PR does not actually require it, like I expect it doesn't.

This PR is becoming a bit of a mess, rather than being focused/simple.

Comment thread Dockerfile.cuda-all
curl \
libssl-dev \
pkg-config
pkg-config && \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pkg-config && \
pkg-config

Comment thread Dockerfile.cuda-all Outdated
Comment on lines +61 to +62
openmpi-bin \
intel-oneapi-hpc-toolkit && \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were the contextual comments removed in 444ddc0 ?

Suggested change
openmpi-bin \
intel-oneapi-hpc-toolkit && \
# Provided for convenience when using the NCCL crate feature:
openmpi-bin \
# Provided for MKL feature runtime:
intel-oneapi-hpc-toolkit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They break the &&

@polarathene polarathene Aug 4, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& isn't necessary though, you added that (EDIT: It's unrelated, shell scripts don't support comments in this manner when a line is split via \, that only works as a Dockerfile syntax comment, not a shell script comment 🫤).

The comments still provide important context for these specific packages (that are non-essential, only required to support the added features this PR is enabling).

Please restore the comments.

Comment thread Dockerfile.cuda-all
HEREDOC

RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y && test -f /root/.cargo/bin/rustup

@polarathene polarathene Aug 3, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The associated commit for the test addition here is about exit codes? Did you encounter a failure?

If it was due to curl itself failing which I assume was more likely, then it's more about how bash works with the pipe operator by default.

Suggested change
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y && test -f /root/.cargo/bin/rustup
SHELL ["/bin/bash", "-e", "-o", "pipefail", "-c"]
RUN curl -fsSL https://sh.rustup.rs | bash -s -- -y

Typically the SHELL instruction would be set earlier than this suggestion places it, and it will instead default RUN instructions to use that command instead of the default /bin/sh -c command.

However since you have brought up podman in the past, for those that would build a Dockerfile with this instruction it is not part of OCI standard spec IIRC, so Podman needs an extra opt-in flag for it's build command (it'll otherwise fail and point that problem as the failure cause). Minor concern though given the improvement to the build process.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - turns out the way the dockerfile was previously running, all of these cases could (verified in testing) fail while the next line executes returning 0 and letting the build continue with a broken dependency.

@polarathene polarathene Aug 4, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So your approach to fix this was with test + && chains, totally understandable.. but my suggestion with bash -e -o pipefail is far better as it does not need either of those "fixes" and reduces any related maintenance burden when the contents of RUN are changed.

When a command fails with a non-zero exit code bash -e will fail early, while usage of the pipe operator by default doesn't carry the piped commands exit status unless using -o pipefail.

NOTE: This suggestion has been raised through a separate PR.

@sempervictus

Copy link
Copy Markdown
Contributor Author

I don't know why you added && \ to the end of the package install commands? Revert that.

You shouldn't need to create the nccl symlink either? Please remove after confirmation.

This PR is becoming a bit of a mess, rather than being focused/simple.

Please remember i'm not an LLM, you're making directive statements at another human being immediately after stating that you don't understand what is being done and why. I'm sure you can derive the problems with that context by reversing our roles. When an old 0311 tells you that your interpersonal style is a concern, you might be way into "pissing actual humans off with that approach" territory - take that for whatever you think it's worth.

Every change in that commit is for a fall-through of exit codes observed in testing. If curl doesnt install because apt fails but the lists removal succeeds, then rustup silently fails but bash invocation succeeds. The comments between the multiline && invocations were breaking the flow but the same issue of exit codes being swallowed allowed the two lines stating those package names are not commands to fall through.

@polarathene

polarathene commented Aug 4, 2025

Copy link
Copy Markdown
Contributor

Please remember i'm not an LLM, you're making directive statements at another human being

I've not inferred you to be an LLM. I am the only participant actively engaging in PR review here for the benefit of the project.

Unfortunately I rarely have the time and energy to adapt my style of communication to better cater to your feelings and preferences (I can try if you are also willing to do the same).

Any negativity you're perceiving is unintentional, I'm just focused on the changes and feedback from a logical POV.

"directive statements" as in review feedback/guidance? The && was not necessary for a build, but the additional caution is fine, just better handled via SHELL.

The symlink concern I raised very early on and has been consistently ignored. The symlink portion prior to it when you opened this PR has already been removed from the Dockerfile since, dynamic linking shouldn't be using unversioned symlinks.

immediately after stating that you don't understand what is being done and why.

I understand the terse intent of the commit message itself, I'd just appreciate more context, something that you've repeatedly lacked and left unanswered multiple times when I've attempted to engage with you.

The reason I asked why, is for why are you doing it this way? In which I seek more clarification in case there is a misunderstanding on my end, so it is helpful to know what particular problems you were encountering. Regardless I provided a change suggestion that was more appropriate.


When an old 0311 tells you that your interpersonal style is a concern, you might be way into "pissing actual humans off with that approach" territory - take that for whatever you think it's worth.

I just care about getting things right for the project. If I've communicated in a manner you dislike, I apologize but it goes both ways? 🤷‍♂️

Correct me if I'm mistaken, but I think this is the first time you've communicated to me that you're not happy with how I've engaged in discourse with you? I cannot easily know that prior to being informed if I'm upsetting someone - it's definitely not intentional but I am aware how blunt and direct I can be (in addition to the verbosity that not many have the patience for).

If you have suggestions for how I could improve my communication with you, I'll be happy to accommodate on the basis that you also are willing to make an effort for where I've found your own behaviour to be rude; I've just assumed we're both adults and it's not that big of a deal so long as the PR goes through proper review.


If curl doesnt install because apt fails but the lists removal succeeds, then rustup silently fails but bash invocation succeeds. The comments between the multiline && invocations were breaking the flow but the same issue of exit codes being swallowed allowed the two lines stating those package names are not commands to fall through.

Thank you.

  1. The use of a bash script via HEREDOC without introducing SHELL instruction to better handle bailing early is my fault as I contributed that, I'll open a PR to address that mishap (EDIT: Done).
  2. The contextual comments is also related but those packages are tied to this PR so that would need to be addressed here. The comments should be restored, but placed prior to the command. Whilst in a typical RUN instruction they would have been parsed as Dockerfile comments and thus stripped out before execution, but with the HereDoc feature this instead sends the whole contained string to SHELL as input.

RageLtMan added 3 commits August 4, 2025 21:08
Update CUDA (base image containers) to 12.9.1.

Drop minimum compute capability requirement to v70 - mistral-rs is
great on older devices which do not support flash attention (in
the same hardware facilities as v8+). The current device CC can
be passed as an arg using:
```
nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits | \
sed 's|\.||'
```

Enable NCCL feature for the CUDA build target. Add OpenMPI and
library symlink for NCCL to runtime.

Enable MKL build feature and deploy dependencies into containers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants