fix: SC CPU-offload support + single-GPU max_memory headroom#2
Open
heroarmor wants to merge 1 commit into
Open
fix: SC CPU-offload support + single-GPU max_memory headroom#2heroarmor wants to merge 1 commit into
heroarmor wants to merge 1 commit into
Conversation
Two changes so 70B+8B spec-decode runs without 2 full GPUs: 1. CPU-offload fix (sc_model.py): when replacing nn.Linear/LM-head with SCLinear, transfer accelerate's _hf_hook to the new module. Without it the offloaded weight stays a 'meta' placeholder and the SC forward crashes with 'Tensor on device meta'. Applied to both the decoder-layer linears and the LM head. 2. Single-GPU headroom (loader.py, run_spec_decode.py): optional per-model max_memory (via TARGET_MAX_GPU_GIB / DRAFT_MAX_GPU_GIB / OFFLOAD_CPU_GIB) so both models leave GPU room for SC scratch buffers and don't OOM. Unset -> unchanged auto-dispatch. Verified: 70B target + 8B draft spec-decode completes on a single GPU with offload, bit-identical output to the 2-GPU sharded run (SC accept 0.964). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Running 70B+8B spec-decode without two full GPUs hit two walls:
replace_linears_with_sc(and the LM-head swap) replace annn.LinearwithSCLinear, accelerate's offload hook (_hf_hook) was dropped. The offloaded weight stayed ametaplaceholder and the SC forward crashed:RuntimeError: Tensor on device meta.Fix
sc_model.py— transfer the existing_hf_hookto the replacement module (decoder-layer linears and the LM head), so offloadedmetaweights are materialized at forward.loader.py+run_spec_decode.py— optional per-modelmax_memoryviaTARGET_MAX_GPU_GIB/DRAFT_MAX_GPU_GIB/OFFLOAD_CPU_GIB, so each model leaves GPU headroom for SC scratch. Unset → unchanged auto-dispatch.Validation
70B target + 8B draft spec-decode now completes on a single GPU with CPU offload (70B→50 GiB, 8B→18 GiB), producing bit-identical output to the 2-GPU sharded run (SC accept 0.964, 3.86 tok/step).
Depends on the device-guard fix in scmp_kernels (CrucibleComputingGroup/scmp_kernels#20) for the multi-GPU path.
🤖 Generated with Claude Code