Feature/multi gpu#84
Open
theodufort wants to merge 2 commits into
Open
Conversation
Allow simulating multiple GPUs with comma-separated names ("RTX 5080,RTX 5060 Ti")
or count shorthand ("2x RTX 4090", "4x H100"). VRAM is pooled across all GPUs for
fit determination. Speed estimation uses a tensor-parallel model where the slowest
GPU is the bottleneck, with inter-GPU communication overhead (PCIe/NVLink) factored in.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The initial multi-GPU commit naively pooled all GPU VRAM as one device and assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink (wrong for consumer GPUs like RTX 4090). This commit: - Applies per-GPU framework overhead (~300MB) and a utilization factor (95% homogeneous, 90% heterogeneous) to VRAM fit checks - Replaces the NVLink/PCIe sync model with a flat 30% overhead factor, avoiding false precision about interconnect topology - Adds warnings for multi-GPU and heterogeneous configurations - Fixes broken --vram error message grammar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
Thanks @theodufort, this is a lot of careful work, and the conservative approach to VRAM pooling and speed looks right to me. It conflicts with main now, and since it touches the ranker, compatibility, and performance paths, I'd like it rebased on the latest main before I review it properly. Could you update it? I want to give the fit logic a careful read once it's clean. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allow simulating multiple GPUs with comma-separated names (RTX 5080,RTX 5060 Ti) or count shorthand (2x RTX 4090, 4x H100). VRAM fit uses conservative pooling — per-GPU framework overhead (~300MB each) and a utilization factor (95% homogeneous, 90% heterogeneous) are applied rather than naively summing all VRAM as one device. Speed estimation uses a conservative flat 30% overhead factor without claiming precision about the interconnect topology (PCIe vs NVLink); multi-GPU throughput is always marked low-confidence.
What
Add multi-GPU support to implement #65
Why
Users with multiple GPUs (e.g. 2x RTX 3090, mixed RTX 4090 + 3090) need to see which models fit across their combined VRAM. The fit simulation layer comes first; speed modeling is deliberately conservative until interconnect and tensor-split assumptions can be properly validated.
Testing
pytest)Notes