Skip to content

fix(utils): call empty_cache() after fp16→fp32 casts in prepare_model_for_kbit_training#3293

Open
umutonuryasar wants to merge 1 commit into
huggingface:mainfrom
umutonuryasar:fix/kbit-training-cuda-memory-overhead
Open

fix(utils): call empty_cache() after fp16→fp32 casts in prepare_model_for_kbit_training#3293
umutonuryasar wants to merge 1 commit into
huggingface:mainfrom
umutonuryasar:fix/kbit-training-cuda-memory-overhead

Conversation

@umutonuryasar

Copy link
Copy Markdown

The bulk param.data = param.data.to(torch.float32) loop creates temporary
tensors that PyTorch's CUDA allocator keeps cached even after they are no
longer referenced, resulting in ~1 GB of reserved-but-unused CUDA memory
on return. This breaks training on 8 GB unified-memory devices.

Fix: add a single torch.cuda.empty_cache() call (guarded by
torch.cuda.is_available()) after the cast loop so the allocator releases
those blocks back to the driver immediately.

Fixes #3265

…_for_kbit_training

The bulk param.data = param.data.to(torch.float32) loop creates temporary
tensors that PyTorch's CUDA allocator keeps cached even after they are no
longer referenced, resulting in ~1 GB of reserved-but-unused CUDA memory
on return. This breaks training on 8 GB unified-memory devices.

Fix: add a single torch.cuda.empty_cache() call (guarded by
torch.cuda.is_available()) after the cast loop so the allocator releases
those blocks back to the driver immediately.

Fixes huggingface#3265
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant