fix: SC CPU-offload support + single-GPU max_memory headroom by heroarmor · Pull Request #2 · CrucibleComputingGroup/scmp_speculative_decoding

heroarmor · 2026-05-27T07:33:40Z

Problem

Running 70B+8B spec-decode without two full GPUs hit two walls:

CPU offload crashed — when replace_linears_with_sc (and the LM-head swap) replace an nn.Linear with SCLinear, accelerate's offload hook (_hf_hook) was dropped. The offloaded weight stayed a meta placeholder and the SC forward crashed: RuntimeError: Tensor on device meta.
Single-GPU OOM — both models packed the card, leaving no room for the SC kernels' scratch buffers.

Fix

sc_model.py — transfer the existing _hf_hook to the replacement module (decoder-layer linears and the LM head), so offloaded meta weights are materialized at forward.
loader.py + run_spec_decode.py — optional per-model max_memory via TARGET_MAX_GPU_GIB / DRAFT_MAX_GPU_GIB / OFFLOAD_CPU_GIB, so each model leaves GPU headroom for SC scratch. Unset → unchanged auto-dispatch.

Validation

70B target + 8B draft spec-decode now completes on a single GPU with CPU offload (70B→50 GiB, 8B→18 GiB), producing bit-identical output to the 2-GPU sharded run (SC accept 0.964, 3.86 tok/step).

Depends on the device-guard fix in scmp_kernels (CrucibleComputingGroup/scmp_kernels#20) for the multi-GPU path.

🤖 Generated with Claude Code

Two changes so 70B+8B spec-decode runs without 2 full GPUs: 1. CPU-offload fix (sc_model.py): when replacing nn.Linear/LM-head with SCLinear, transfer accelerate's _hf_hook to the new module. Without it the offloaded weight stays a 'meta' placeholder and the SC forward crashes with 'Tensor on device meta'. Applied to both the decoder-layer linears and the LM head. 2. Single-GPU headroom (loader.py, run_spec_decode.py): optional per-model max_memory (via TARGET_MAX_GPU_GIB / DRAFT_MAX_GPU_GIB / OFFLOAD_CPU_GIB) so both models leave GPU room for SC scratch buffers and don't OOM. Unset -> unchanged auto-dispatch. Verified: 70B target + 8B draft spec-decode completes on a single GPU with offload, bit-identical output to the 2-GPU sharded run (SC accept 0.964). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

heroarmor requested review from Allenjin123 and Copilot May 27, 2026 21:16

Copilot started reviewing on behalf of heroarmor May 27, 2026 21:16 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: SC CPU-offload support + single-GPU max_memory headroom#2

fix: SC CPU-offload support + single-GPU max_memory headroom#2
heroarmor wants to merge 1 commit into
feat/per-row-halve-bitrev-defaultsfrom
fix/sc-cpu-offload-and-single-gpu-maxmem

heroarmor commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

heroarmor commented May 27, 2026

Problem

Fix

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants