Docker: CUDA 12.4->12.9, enable NNCL, support CCv7#1468
Conversation
WalkthroughThe Dockerfile for the CUDA build environment was updated to use CUDA version 12.9.0 instead of 12.4.1, with a lowered compute capability argument from 80 to 70. The build features were expanded to include Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 3 63 54 0 9 CSS 1 473 408 14 51 Dockerfile 1 39 22 8 9 HTML 1 78 64 5 9 JavaScript 7 1397 1068 180 149 JSON 22 410 407 0 3 Makefile 1 6 5 0 1 Python 102 5660 4631 298 731 Shell 1 63 26 18 19 Plain Text 3 3723 0 2413 1310 TOML 23 866 797 11 58 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 74 6981 0 5227 1754 |- BASH 19 299 260 24 15 |- JSON 11 523 523 0 0 |- Python 14 521 434 35 52 |- Rust 32 1320 1108 36 176 |- TOML 2 75 63 0 12 (Total) 9719 2388 5322 2009 ------------------------------------------------------------------------------- Rust 421 156096 137867 3983 14246 |- Markdown 200 4347 285 3497 565 (Total) 160443 138152 7480 14811 =============================================================================== Total 665 175876 145368 12159 18349 =============================================================================== |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile.cuda-all (1)
29-29: Lower minimum compute capability to 70
Dropping the CC requirement from 80 to 70 broadens device support. Please update any documentation (e.g., README, CI matrices) to reflect CC ≥ 7.0 as the new minimum.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
Dockerfile.cuda-all(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Docs
- GitHub Check: Clippy
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
- GitHub Check: comment
🔇 Additional comments (2)
Dockerfile.cuda-all (2)
3-3: Upgrade CUDA builder image to 12.9.0
The base image has been bumped tonvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04. Ensure that the dependent candle repo PR (#83) or an equivalent update has been merged so that CUDA 12.9 is fully supported across your toolchain.
36-36: Upgrade CUDA runtime image to 12.9.0
The runtime image was updated tonvidia/cuda:12.9.0-cudnn-runtime-ubuntu22.04. Confirm it exists and provides the necessary CUDA/CUDNN (and NCCL, if used at runtime) libraries.
55d3b8d to
7315e27
Compare
|
Tested working on V100s w/ NCCL distribution. It seems a bit VRAM hungry |
Could you please clarify? Forward compatibilityBy default PTX of an earlier CC should still be present to provide forward compatibility support of newer GPUs. Depending on the GPU, CUDA 12.8 or 12.9 libs should only matter when linking to libs to use API calls that have either as a minimum requirement to build. Otherwise AFAIK when using older virtual arch via PTX, the CUDA version should only matter on the runtime side? CUDA version should be unrelated to Compute Capability, other than linked libs containing ELF/PTX kernels for the newer GPU archs, but without static linking none of that should matter at build time (the user can use a newer runtime to get their So I'm not sure if updating The other scenario is with That won't be relevant either however with the default images built and published, unless CI is also updated accordingly, users would need to perform custom builds themselves with ARGs. Have you considered adding additional ARGs for the build/runtime base images instead? For custom builds that'd make bumping the CUDA version for either quite easy.
I don't have access to the newer GPU hardware to try, but I'd love to see a performance comparison of the CC version making a difference for a CUDA kernel that doesn't explicitly leverage newer features or data types. Reference
$ docker run --rm -it nvcr.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04
# CC 12.0 PTX:
$ cuobjdump --list-ptx /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_120.
# Real arch cubins embedded:
$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublas_static.a | grep -oE '\.sm_.*\.' | sort -u
.sm_100.
.sm_120.
.sm_50.
.sm_50a.
.sm_60.
.sm_60a.
.sm_61.
.sm_61a.
.sm_70.
.sm_75.
.sm_80.
.sm_86.
.sm_90.These are available for CUDA 12.8 (12.9 is important to avoid some issues from 12.8 with newer GPUs $ docker run --rm -it fedora:41
$ dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo
# NOTE: If building for CC 100/120+ prefer CUDA 12.9
$ dnf install cuda-nvcc-12-8
$ /usr/local/cuda-12.8/bin/nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120 |
polarathene
left a comment
There was a problem hiding this comment.
- Revert changing the default/fallback
CUDA_COMPUTE_CAP - If OpenMPI is complimentary in the runtime image, clarify that via comment. Documenting an example for usage with the NCCL feature may be beneficial to users.
- Add better contextual comment regarding symlink requirement only applying to runtime image. Use a relative symlink like devel.
- Drop
LD_LIBRARY_PATHaddition, unless it's provably required, then provide context as to what requires it.
| # Rayon threads are limited to minimize memory requirements in CI, avoiding OOM | ||
| # Rust threads are increased with a nightly feature for faster compilation (single-threaded by default) | ||
| ARG CUDA_COMPUTE_CAP=80 | ||
| ARG CUDA_COMPUTE_CAP=70 |
There was a problem hiding this comment.
👎
This is just the default ARG, what is the motivation to default to a lower CC by default?
Below CC 8.0 lacks support for BF16 for example. Defaults should strike a balance that suits a wider audience without lowering optimizations to support hardware that is approx a decade old? (just like defaulting to hardware that is too new sacrifices compatibility for performance)
The default should not be lowered here IMO since a custom build can be done (and in this case you can have it built with CI to publish the image), so keep it at 80.
There was a problem hiding this comment.
I recently responded to a prior discussion you were in, where I referenced another project that documents Turing (CC 7.5) as the min supported Compute Capability for building.
CUDA release notes also documents a deprecation notice of Maxwell to Volta (CC 5.0 to 7.2) being dropped from CUDA 13 onwards.
Supporting CC 7.0 at that point will require additional maintenance/ARGs, or holding the image back from migrating to CUDA 13 when it's suitable to upgrade to. Using ARGs for the base image can help, where CI matrix can adjust support for different base images.
There was a problem hiding this comment.
This is still cuda 12, so 7 is supported. When making all of the changes to 13 this should be bumped. Meantime, there are more systems supporting 7 as its a subset of 10 (current). This does not enable any features unsupported by modern hardware to cause incompatibility. Tying to broaden adoption - big world, lots of folks w older fare and academia isn't exactly brimming with cash to buy b300s.
There was a problem hiding this comment.
Sorry? Are you're not understanding my feedback,or how a Dockerfile works?
There is absolutely no need to change the ARG here. It will not magically produce official images with that capability, the only convenience you get is when manually building the image you can skip --build-arg CUDA_COMPUTE_CAP=70.
There is absolutely no need for that minor convenience for users on hardware that's approx a decade old. Not when it opts out of performance benefits from higher CC for the wider audience of users (as a default).
A default is for the broader audience and what suits them best, not the oldest hardware possible to support at the expense of the majority of users.
Leave it at 80 and change it for your own personal builds. If you want to justify 70 as a default show how many references on this project has users expressing frustration from not knowing how to build the image with their CC.
If instead you just want official builds for CC 7.0, then instead adjust the CI workflow that builds the variants and publishes to GHCR. Your change here is the wrong place for such.
If you need an easy example of CC support, data types are a common one. Take FP16, if you build with CC 5.3 or newer you can use it. Anything older and the build would fail unless there's an eexplicit fallback in the kernel.
As a default you are intentionally dismissing the more optimal features for the bulk of cuda devices supported to avoid providing an extra arg when extra compatibility is needed. That is not how defaults should be chosen.
There was a problem hiding this comment.
I think i'm missing the part where the minimum compute capability defined limits use of facilities available in newer hardware... This is intended to broaden the range of hardware on which the container will run correctly if built with defaults. If the parameter prevents the resulting code from leveraging newer hardware correctly, then it makes sense to pull the change and probably wire-in a resolver for the users' environment to detect that at build-time if not provided.
https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/ for one - nvidia is stabilizing drivers and runtimes around the blackwell gear on the fly. The mezzanine change alone from a PCI complex to an IB interface ( Re |
|
@polarathene thanks for the commits. |
I don't know your experience in this area (Docker builds or CUDA builds), but here's the gist of it:
Your runtime container should provide a CUDA driver (
If you need more clarity on either of those points, please go over the full response. Original reply followsWas the TLDR checklist not clear? That should be fairly clear, I will try clarify the compute cap concern for you better...
Honestly, I'm not wanting to repeat myself because my feedback is too vague, or too verbose. If you want to push for making a change you don't understand, then please find the time to understand it.
I am not familiar with this project's test suite. The bulk of my feedback is discouraging your change to the default CC, defaults should be tailored for the majority.
Regarding the CUDA version bump:
|
As mentioned in previous reply. Those updates should occur on your container host which has the kernel driver and CUDA driver (user-space, During compilation in the builder image, you need to be using newer APIs from the driver/runtime API for any difference there, which should call into the mentioned container host deps during runtime. Only other relevant difference is the CC target, which for Blackwell would be leveraging PTX of a lower CC currently. Bumping to a higher version of CUDA (and thus So as far as stabilization is going, this bump shouldn't be necessary AFAIK. You should get that implicitly (image agnostic). Only when you're running with kernels specialized to Blackwell instructions (CC adjusted for PTX/cubin at build) should you have build / runtime concerns where it's relevant. |
Thanks for the clarification. The driver updates are of course on the container host as they're in ring0 past the boundary of a namespaced runtime; the commensurate CUDA updates however are what i am hoping to expose by presenting the 12.9 build and runtime environments. To the point about bumping I suppose my concern here is "do you see this PR introducing any problems for current use-cases or users who would get these defaults?" |
Sure, but like I said:
CC 8.0 can build
This is why we have the default set at CC 8.0, but CI will build All I'm saying is that leaving the ARG at the default CC 8.0 makes more sense. Supporting lower CC should remain opt-in, and if you'd like official images pre-built for Volta / CC 7.0, then the correct change for that is to update the CI instead to build for that CC as well. You will likely also want to have CI build for Blackwell CC 10.0 / CC 12.0. These CC also have some additional variations to be aware of if you're wanting to tune for performance or compatibility.
As stated, when it comes to the CUDA API, there should be no concerns here requiring CUDA 12.9 unless using newer API calls that require it. CUDA 12 series is forward compatible, the shared lib linked should look like The only other difference is the kernels embedded in libs like CuBLAS. Where newer PTX or Cubins for a GPU arch like Blackwell should result in improved performance. CUDA 12.8 had some bugs when building PTX for Blackwell, but PTX for earlier CC should have been fine as it wasn't using newer instructions. I could be mistaken here, but without CUDA 12.8+ you could not build cubins for Blackwell, so the earlier PTX CC built would be reliant upon runtime having CUDA 12.8+ to JIT compile that PTX to cubin for Blackwell, and potentially with CUDA 12.8 that was buggy too. If that's the case, then yes CUDA 12.9 at runtime would make a difference when no pre-built CUDA 12.9 Sorry for all the repetition, I feel CUDA compatibility is a bit of a messy topic to explain well when it comes to nuances like these 😓 |
No, for most users I think it'll be fine. Anyone doing local builds relying on defaults may experience a perf regression. I see no issues for anyone using the images the project publishes to GHCR. Those are published with the CC as the tag. I just think you should leave the CC as-is, as I don't think anyone is complaining about the default CC for builds (specific to image builds). AFAIK, raising the CUDA version for build and runtime stages of the Dockerfile should also be fine as the major version is the same. Even if using the binary outside of the container, I think CUDA 12 is still fairly backwards compatible, but for the image published, the only thing that really changes there is I could do a build to verify on my 4060 with CUDA 12.4 if that'd help? UPDATE: There may be some risk as described here: chelsea0x3b/cudarc#253 (comment) I will need to make time to try reproduce. |
Thanks for the clarification - performance regressions/improvements are definitely something i'd love to test but i think we would need some sort of opt-in/explicit submission performance registry/scoreboard for the project to make that idea viable (i have access to far more hardware than most but i can't really be running benchmarks on client kit without written authorization/them writing off the the GPU hours involved). In terms of the default published images, that's really what i'm shooting for here - broaden adoption. There is a huge amount of pre-bf16 metal out there and most people spinning up a container to test a runtime will just move on if it doesnt work out the gate. I think we'll get more adoption by having a broadly accepted image with clear instructions for how to rebuild for local kit (some detection along the lines of native architecture optimization/detection in conventional compilers should ease that process) than if users turn away at first glance. In regards to chelsea0x3b/cudarc#253 (comment) - mixing and matching build/runtime should be possible with stable ABI but we've seen an immense amount of issues over the years with just source-built GPGPU software stacks (OFED, NV, NCCL, UCX, GDRCOPY, all that fun stuff) having access to each others source files at build time to say "dynamic linking works 100% correctly across the CPU/GPU ABIs." The move to DOCA and the IB-based interconnects we're seeing on the B series do not give me the warm and fuzzy for this concern in the near term :-(. I see mention of WSL there as well - my win-fu is a bit out of date past figuring out how to bypass EDR and get whatever flag's the objective, but the OS-specific driver code is more or less a framing shim for the actual driver in the device firmware and "unless all data types and structures line up perfectly between windows and linux implementations" (they do not), there will be ABI mismatch like signedness, precision, or casting outcomes which messes things up even more. I dont mean to come off as "debbie downer" on this stuff - i think the dynamic linking approach is great and the way software should work in a perfect world; i've just been dealing with this video card driver binary->code->back stuff since the 90s and "quality improvement has been incremental while complexity growth exponential." |
Yeah it might be a bit more work if mistral-bench isn't sufficient. The last time I was involved in this project it was only focused on LLMs (just prior to LLMs with vision support), there is quite a few permutations to go over for that 😓 It might be easier with Burn instead of Easiest test we can do is the impact of CC version on CUDA kernel compilation from builds, and additionally compare runtime CUDA libs linked (CuBLAS, CuDNN, etc) in the same way ideally. mistral-bench might be sufficient for that and CC 7.0 vs CC 12.0 is perhaps wide enough of a gap that you might get a notable performance difference, but in both cases Blackwell should be able to run the two builds.
Oh, okay well that's quite simple but like I mentioned adjusting the default for mistral.rs/.github/workflows/build_cuda_all.yaml Lines 13 to 17 in 4608202 I'm presently still getting familiar with building/optimizing images for CUDA and related CI publishing workflow for that. This project is one of a few I'll be contributing PRs to once I've become confident enough in that area, but I think some upstream crates might need some PRs raised from what I've noticed thus far in my research on the topic, or at least for more optimal builds at least. I do want to look into this concern about bumping CUDA to 12.9 however, I should have an answer for that in the next few days when time allows 👍
I'm not disregarding that. But like I've cited previously, some CUDA kernels may not have fallback support. AFAIK that will result in a compile time failure, the other compatibility issue can be with CUDA toolkit support (If you go as far back as CC 5.0 / Maxwell, that is CUDA 12.0+). Personally when I publish builds, I try to not be wasteful publishing support for builds that have no (or minimal) demand. That's just a waste of resources/energy (even if it doesn't cost me directly). I tend to defer users to documentation for custom builds until enough actual user demand is expressed to warrant broader official builds. Presently, the I'm not sure how best to approach this for users UX when it comes to selecting a CUDA published image, while I think most can select a CUDA versioned tag, what actually matters is the image has built support for their CC or at least provides PTX with CC that is compatible. If we had a default For the most part though, they just need to check their CC and choose a CC tag that matches their GPU CC, or the next lowest if that's not available. Should be easy to document (verified unaffected by using CUDA compat libs): $ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
8.9
I haven't looked at this for context outside of Docker, but I do know the Docker related docs are lacking and would like to see that improved too. When building outside of a container, there is already some detection (similar to the As you've stated about broader compatibility, containers often are built without native CPU instruction optimizations since that can result in incompatibility on other systems. With CC version though it seems it may have a worthwhile impact to ensure the image is built with an appropriate CC, so that should be documented (could potentially take that
I can't comment much on that. I am aware that compatibility was improved from CUDA 12, and that this will be further addressed with CUDA 13 (there's a blog post detailing it and how some changes might be breaking regarding defaults adjusted with Some issues I've come across are misconfiguration at user/developer ends from inexperience, which is fair enough (just look at the discussion we're having on the topic ha). I've learned quite a bit myself in recent weeks despite the 20 years I've had as a dev 😅
I need reproducible examples for such personally. I'm not running into any WSL2 related issues for linux compatibility or with the CUDA driver thus far. The way a container is provided access to the GPU was something I was more concerned about, and how that played with CUDA version of runtime images, but I think I have a handle on that now. |
I am not sure what you're trying to say here.
Additional pros/cons of each linking methodDynamic Loading:
Dynamic Linking:
Static Linking:
Slight disadvantage for dynamic linking/loading vs static linking:
I think that covers the choices well and I'm leaning towards dynamic linking. At least for personal |
Apologies, tapped wrong button :-\. The reason i harp on application binary interfaces on NV's side (specifically for Linux anyway) lies in the way they "make their GPL condom" - they hijack kernel functions in their modules hitting a few AT&K items along the way. Exactly the sort of nonsense Rust tries to intrinsically avoid (despite not actually building CFI checks yet when i last checked through LLVM's underlying logic). Even though the higher-level libraries may be well-designed, have proper interfaces, and well-versioned API/ABI - further down the call stack it all calls into "some questionable code" generally resulting in the smallest amount of divergence being between two codebases that were "built together" (575/12.9, 570/12.8, etc). That said, i dont think anyone's going to stub their toe on 575 and 12.8 or 570 and 12.9 but i wouldn't suggest running 3xx series drivers with this :-) |
|
The issue you link is for the kernel driver (
Slight observation regarding runtime libs
|
There was a problem hiding this comment.
As the Dockerfile has changed a bit, and the discussion here is rather long-winded, you may want to progress with a new PR(s)?
Current feedback:
- The
ncclfeature addition:- The
ncclsymlink seems redundant? Remove it. - Adding the
ncclfeature to the build image +openmpi-binpackage to the runtime image might be better handled in a separate PR?
- The
- Bumping the CUDA version for:
- The build image is lacking evidence of a meaningful change
- Introduces a regression? (PTX cannot be used by an earlier CUDA runtime, even if CC for that PTX were compatible)
- EDIT: Relevance would be for Blackwell when building kernels for CC 10.0 + 12.0, instead of relying on PTX from earlier build.
- The runtime image should be ok. This will provide official CUDA libs linked with Blackwell cubins + PTX for potentially better performance. You may prefer to use
12.9.1here as Nvidia doesn't seem to publish a12.9(major + minor only) tag.
- The build image is lacking evidence of a meaningful change
- Both build and runtime images could bump the base distro from Ubuntu 22.04 to 24.04. That should be fine given the runtime environment would be compatible, it's only the GPU driver that's a concern.
| # Only the `devel` builder image provides symlinks, restore the `libnccl.so` symlink: | ||
| RUN ln -s libnccl.so.2 /usr/lib/x86_64-linux-gnu/libnccl.so |
There was a problem hiding this comment.
I don't think this is necessary, could you please verify on your end and report any failure details if it is?
candle-corehandlescudarcfeature enablement forcudnn/nccl.candle-corefork presently trackscudarc@0.13(as you're aware), and it opts into thedynamic-linkingfeature. I'm a bit confused here since neithercudnnornccllibs are linked to the binary built?
ldd /usr/local/bin/mistralrs-server
linux-vdso.so.1 (0x00007fff58725000)
libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007e24e2600000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007e24e2382000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007e24e22d8000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007e24e1dc5000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007e24e1d97000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007e24e1cae000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007e24e1a9c000)
/lib64/ld-linux-x86-64.so.2 (0x00007e24e9db5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007e24e28b5000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007e24e1a97000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007e24e1a92000)
The unversioned symlinks should only be necessary for the build stage, as the linker searches for these, and then from the symlinked file resolved, retrieves the DT_SONAME entry for what to link against at runtime.
In fact the whole earlier RUN for creating unversioned symlinks in the runtime image is unnecessary. I can only assume this was the case with dynamic-loading in the past not knowing any better. Typically these unversioned lib symlinks don't exist on production systems.
| # syntax=docker/dockerfile:1 | ||
|
|
||
| FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder | ||
| FROM nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04 AS builder |
There was a problem hiding this comment.
TL;DR:
- Raising this minor version risks regressions for runtime usage when the host CUDA driver is older than the CUDA used for build.
- I don't think we should do this without evidence that can be reproduced to justify it.
I have since learned this makes PTX compiled during build incompatible at runtime with any CUDA version older than used with nvcc.
That is, if you build with CUDA 12.9, but your Blackwell RTX 5xxx GPU at runtime is using CUDA 12.8 libcuda.so.1, then it cannot use the PTX built from CUDA 12.9, even when the CC was the same CC as your GPU or lower.
If your build environment was an older or equal version of CUDA as your runtime, then the PTX would be compatible. As such by raising the build version too high, you are introducing a regression for PTX usage that should otherwise be viable, without any evidence to demonstrate a meaningful benefit. From what I've seen this type of problem contributes to less than obvious bug reports to projects where troubleshooting is problematic due to lack of awareness about this type of compatibility concern.
For reference, when using the runtime image CUDA 12.9 compat package libcuda.so.1 via LD_LIBRARY_PATH=/usr/local/cuda/compat, while this has a noticeable effect with nvidia-smi --version on the CUDA version, on my system (RTX 4060 / sm_89 + CUDA 12.4) the CUDA programs I've tried (basic kernels with just nvcc or cudarc), the compat lib prevents the program running stating no CUDA device detected.
This change will thus only allow major version forward compatibility within the same cubin CC built. However since bindgen_cuda presently lacks building for more than one CC, projects like mistral.rs and TEI are building/publishing multiple bins as a workaround, which both currently can leverage the PTX for newer GPU support.
| # Rayon threads are limited to minimize memory requirements in CI, avoiding OOM | ||
| # Rust threads are increased with a nightly feature for faster compilation (single-threaded by default) | ||
| ARG CUDA_COMPUTE_CAP=80 | ||
| ARG CUDA_COMPUTE_CAP=70 |
There was a problem hiding this comment.
As discussed earlier, this change seems unnecessary. Resolve it via CI + documentation instead.
Just add 70 to this list to get an image build for CC 7.0:
|
From the recent commit message for better visibility:
|
|
@EricLBuehler should we include |
ce7caf3 to
4d45bfd
Compare
|
With the recent 12.9 enablement, this should now build correctly to the local target using @polarathene - on the CC70/80/100 default value aspect of this: short of doing matrix builds to see which ones can still run which models on which HW, is there a good mechanical way (even if we have to have cited references by some research-prompted model to the NV docs/CUDA sources for posterity) to figure out which default will give us the broadest coverage without forcing data types or associated features to be inlined where they might crash on an older system without that capability? |
Most projects in this space that I've come across seem to have CC 7.5 as the lowest or even CC 8.0. It can be as low as you can support, but that introduces regressions in performance AFAIK when the hardware is more capable. I don't have the hardware to compare personally, but I also don't see an issue with those on older or newer hardware to need a default build support when all they need to do is adjust the image tag? The CI already has matrix builds to build for different CC and tag them. Just like how we have many distros moving forward with building CPU packages for x86_64 targeting the v3 level as a default, but not the newer v4. |
There was a problem hiding this comment.
I don't know why you added Adding && \ to the end of the package install commands?&& after apt-get install is unnecessary, what actual problem are you trying to solve? Please revert that, there is a better way to resolve your concern (FWIW, if you were to bring back &&, you weren't covering apt-get update potentially failing either, and if you did we lose most benefits from HereDoc usage, vs fixing this properly by adding SHELL).
You shouldn't need to create the nccl symlink either? Please remove that addition after verifying your PR does not actually require it, like I expect it doesn't.
This PR is becoming a bit of a mess, rather than being focused/simple.
| curl \ | ||
| libssl-dev \ | ||
| pkg-config | ||
| pkg-config && \ |
There was a problem hiding this comment.
| pkg-config && \ | |
| pkg-config |
| openmpi-bin \ | ||
| intel-oneapi-hpc-toolkit && \ |
There was a problem hiding this comment.
Why were the contextual comments removed in 444ddc0 ?
| openmpi-bin \ | |
| intel-oneapi-hpc-toolkit && \ | |
| # Provided for convenience when using the NCCL crate feature: | |
| openmpi-bin \ | |
| # Provided for MKL feature runtime: | |
| intel-oneapi-hpc-toolkit |
There was a problem hiding this comment.
They break the &&
There was a problem hiding this comment.
&& isn't necessary though, you added that (EDIT: It's unrelated, shell scripts don't support comments in this manner when a line is split via \, that only works as a Dockerfile syntax comment, not a shell script comment 🫤).
The comments still provide important context for these specific packages (that are non-essential, only required to support the added features this PR is enabling).
Please restore the comments.
| HEREDOC | ||
|
|
||
| RUN curl https://sh.rustup.rs -sSf | bash -s -- -y | ||
| RUN curl https://sh.rustup.rs -sSf | bash -s -- -y && test -f /root/.cargo/bin/rustup |
There was a problem hiding this comment.
The associated commit for the test addition here is about exit codes? Did you encounter a failure?
If it was due to curl itself failing which I assume was more likely, then it's more about how bash works with the pipe operator by default.
| RUN curl https://sh.rustup.rs -sSf | bash -s -- -y && test -f /root/.cargo/bin/rustup | |
| SHELL ["/bin/bash", "-e", "-o", "pipefail", "-c"] | |
| RUN curl -fsSL https://sh.rustup.rs | bash -s -- -y |
Typically the SHELL instruction would be set earlier than this suggestion places it, and it will instead default RUN instructions to use that command instead of the default /bin/sh -c command.
However since you have brought up podman in the past, for those that would build a Dockerfile with this instruction it is not part of OCI standard spec IIRC, so Podman needs an extra opt-in flag for it's build command (it'll otherwise fail and point that problem as the failure cause). Minor concern though given the improvement to the build process.
There was a problem hiding this comment.
Exactly - turns out the way the dockerfile was previously running, all of these cases could (verified in testing) fail while the next line executes returning 0 and letting the build continue with a broken dependency.
There was a problem hiding this comment.
So your approach to fix this was with test + && chains, totally understandable.. but my suggestion with bash -e -o pipefail is far better as it does not need either of those "fixes" and reduces any related maintenance burden when the contents of RUN are changed.
When a command fails with a non-zero exit code bash -e will fail early, while usage of the pipe operator by default doesn't carry the piped commands exit status unless using -o pipefail.
NOTE: This suggestion has been raised through a separate PR.
Please remember i'm not an LLM, you're making directive statements at another human being immediately after stating that you don't understand what is being done and why. I'm sure you can derive the problems with that context by reversing our roles. When an old 0311 tells you that your interpersonal style is a concern, you might be way into "pissing actual humans off with that approach" territory - take that for whatever you think it's worth. Every change in that commit is for a fall-through of exit codes observed in testing. If curl doesnt install because apt fails but the lists removal succeeds, then rustup silently fails but bash invocation succeeds. The comments between the multiline |
I've not inferred you to be an LLM. I am the only participant actively engaging in PR review here for the benefit of the project. Unfortunately I rarely have the time and energy to adapt my style of communication to better cater to your feelings and preferences (I can try if you are also willing to do the same). Any negativity you're perceiving is unintentional, I'm just focused on the changes and feedback from a logical POV. "directive statements" as in review feedback/guidance? The The symlink concern I raised very early on and has been consistently ignored. The symlink portion prior to it when you opened this PR has already been removed from the
I understand the terse intent of the commit message itself, I'd just appreciate more context, something that you've repeatedly lacked and left unanswered multiple times when I've attempted to engage with you. The reason I asked why, is for why are you doing it this way? In which I seek more clarification in case there is a misunderstanding on my end, so it is helpful to know what particular problems you were encountering. Regardless I provided a change suggestion that was more appropriate.
I just care about getting things right for the project. If I've communicated in a manner you dislike, I apologize but it goes both ways? 🤷♂️ Correct me if I'm mistaken, but I think this is the first time you've communicated to me that you're not happy with how I've engaged in discourse with you? I cannot easily know that prior to being informed if I'm upsetting someone - it's definitely not intentional but I am aware how blunt and direct I can be (in addition to the verbosity that not many have the patience for). If you have suggestions for how I could improve my communication with you, I'll be happy to accommodate on the basis that you also are willing to make an effort for where I've found your own behaviour to be rude; I've just assumed we're both adults and it's not that big of a deal so long as the PR goes through proper review.
Thank you.
|
Update CUDA (base image containers) to 12.9.1. Drop minimum compute capability requirement to v70 - mistral-rs is great on older devices which do not support flash attention (in the same hardware facilities as v8+). The current device CC can be passed as an arg using: ``` nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits | \ sed 's|\.||' ``` Enable NCCL feature for the CUDA build target. Add OpenMPI and library symlink for NCCL to runtime. Enable MKL build feature and deploy dependencies into containers.
444ddc0 to
7a6cc80
Compare
Update CUDA (base image containers) to 12.9.
Drop minimum compute capability requirement to v7 - mistral-rs is great on older devices which do not support flash attention (in the same hardware facilities as v8+).
Enable NCCL feature for the CUDA build target.
Notes:
Depends on EricLBuehler/candle#83 or
equivalent change to support 12.9 (max supported right now is 12.8)
This should help to get better mileage for H/B series users
especially on Open drivers
Summary by CodeRabbit