Skip to content

Feature/multi gpu#84

Open
theodufort wants to merge 2 commits into
Andyyyy64:mainfrom
theodufort:feature/multi-gpu
Open

Feature/multi gpu#84
theodufort wants to merge 2 commits into
Andyyyy64:mainfrom
theodufort:feature/multi-gpu

Conversation

@theodufort

Copy link
Copy Markdown

Allow simulating multiple GPUs with comma-separated names (RTX 5080,RTX 5060 Ti) or count shorthand (2x RTX 4090, 4x H100). VRAM fit uses conservative pooling — per-GPU framework overhead (~300MB each) and a utilization factor (95% homogeneous, 90% heterogeneous) are applied rather than naively summing all VRAM as one device. Speed estimation uses a conservative flat 30% overhead factor without claiming precision about the interconnect topology (PCIe vs NVLink); multi-GPU throughput is always marked low-confidence.

What

Add multi-GPU support to implement #65

  • Parse --gpu values: comma-separated heterogeneous GPUs and Nx count shorthand
  • Conservative VRAM fit that accounts for per-GPU overhead and heterogeneous split inefficiency
  • Low-confidence speed estimates that avoid faking PCIe/NVLink precision
  • Warnings surfaced for multi-GPU splits and heterogeneous configurations

Why

Users with multiple GPUs (e.g. 2x RTX 3090, mixed RTX 4090 + 3090) need to see which models fit across their combined VRAM. The fit simulation layer comes first; speed modeling is deliberately conservative until interconnect and tensor-split assumptions can be properly validated.

Testing

  • Tests pass (pytest)
  • New tests added (if applicable)
  • [] Tested on real hardware (if hardware-related)

Notes

  • NVLink detection from the initial commit was removed — it incorrectly assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink (consumer GPUs like RTX 4090 do not)
  • vram_available_bytes still reports the raw physical total; the conservative budget is only used internally for fit decisions
  • Per-GPU --vram override is not yet supported for multi-GPU; the error message now says so clearly
  • Speed estimates for multi-GPU are always low-confidence with an explicit note about PCIe/NVLink dependence — this is intentional, not a gap

theodufort and others added 2 commits May 31, 2026 13:40
Allow simulating multiple GPUs with comma-separated names ("RTX 5080,RTX 5060 Ti")
or count shorthand ("2x RTX 4090", "4x H100"). VRAM is pooled across all GPUs for
fit determination. Speed estimation uses a tensor-parallel model where the slowest
GPU is the bottleneck, with inter-GPU communication overhead (PCIe/NVLink) factored in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The initial multi-GPU commit naively pooled all GPU VRAM as one device
and assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink
(wrong for consumer GPUs like RTX 4090). This commit:

- Applies per-GPU framework overhead (~300MB) and a utilization factor
  (95% homogeneous, 90% heterogeneous) to VRAM fit checks
- Replaces the NVLink/PCIe sync model with a flat 30% overhead factor,
  avoiding false precision about interconnect topology
- Adds warnings for multi-GPU and heterogeneous configurations
- Fixes broken --vram error message grammar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Andyyyy64

Copy link
Copy Markdown
Owner

Thanks @theodufort, this is a lot of careful work, and the conservative approach to VRAM pooling and speed looks right to me. It conflicts with main now, and since it touches the ranker, compatibility, and performance paths, I'd like it rebased on the latest main before I review it properly. Could you update it? I want to give the fit logic a careful read once it's clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants