Skip to content

fix: SC CPU-offload support + single-GPU max_memory headroom#2

Open
heroarmor wants to merge 1 commit into
feat/per-row-halve-bitrev-defaultsfrom
fix/sc-cpu-offload-and-single-gpu-maxmem
Open

fix: SC CPU-offload support + single-GPU max_memory headroom#2
heroarmor wants to merge 1 commit into
feat/per-row-halve-bitrev-defaultsfrom
fix/sc-cpu-offload-and-single-gpu-maxmem

Conversation

@heroarmor
Copy link
Copy Markdown
Collaborator

Problem

Running 70B+8B spec-decode without two full GPUs hit two walls:

  1. CPU offload crashed — when replace_linears_with_sc (and the LM-head swap) replace an nn.Linear with SCLinear, accelerate's offload hook (_hf_hook) was dropped. The offloaded weight stayed a meta placeholder and the SC forward crashed: RuntimeError: Tensor on device meta.
  2. Single-GPU OOM — both models packed the card, leaving no room for the SC kernels' scratch buffers.

Fix

  1. sc_model.py — transfer the existing _hf_hook to the replacement module (decoder-layer linears and the LM head), so offloaded meta weights are materialized at forward.
  2. loader.py + run_spec_decode.py — optional per-model max_memory via TARGET_MAX_GPU_GIB / DRAFT_MAX_GPU_GIB / OFFLOAD_CPU_GIB, so each model leaves GPU headroom for SC scratch. Unset → unchanged auto-dispatch.

Validation

70B target + 8B draft spec-decode now completes on a single GPU with CPU offload (70B→50 GiB, 8B→18 GiB), producing bit-identical output to the 2-GPU sharded run (SC accept 0.964, 3.86 tok/step).

Depends on the device-guard fix in scmp_kernels (CrucibleComputingGroup/scmp_kernels#20) for the multi-GPU path.

🤖 Generated with Claude Code

Two changes so 70B+8B spec-decode runs without 2 full GPUs:

1. CPU-offload fix (sc_model.py): when replacing nn.Linear/LM-head with
   SCLinear, transfer accelerate's _hf_hook to the new module. Without it the
   offloaded weight stays a 'meta' placeholder and the SC forward crashes with
   'Tensor on device meta'. Applied to both the decoder-layer linears and the
   LM head.

2. Single-GPU headroom (loader.py, run_spec_decode.py): optional per-model
   max_memory (via TARGET_MAX_GPU_GIB / DRAFT_MAX_GPU_GIB / OFFLOAD_CPU_GIB)
   so both models leave GPU room for SC scratch buffers and don't OOM. Unset
   -> unchanged auto-dispatch.

Verified: 70B target + 8B draft spec-decode completes on a single GPU with
offload, bit-identical output to the 2-GPU sharded run (SC accept 0.964).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants