Fail loud on save_pretrained() for unsharded LoRA tensors by akshansh47 · Pull Request #3251 · huggingface/peft

akshansh47 · 2026-05-21T17:00:30Z

What

Refuse to write LoRA adapters whose lora_A / lora_B tensors look unsharded (1-D or zero-sized). This is the canonical signature of an export that ran without gathering DeepSpeed ZeRO-3 / FSDP shards: the on-disk artifact looks structurally valid (filenames + adapter_config.json present), but downstream loaders fail with confusing index errors at the first attempted use, e.g. IndexError: too many indices for tensor of dimension 1 in vLLM's slice_lora_b during hot-swap.

The new _validate_lora_adapter_state_dict helper surfaces the failure at save_pretrained() time with an actionable hint pointing at deepspeed.zero.GatheredParameters / FullyShardedDataParallel.summon_full_params, instead of corrupting the artifact and deferring the crash.

Why

This failure mode is the most common cause of "my LoRA loads fine in HF/Transformers but breaks in vLLM" reports. It's currently silent. Examples:

[Bug]: LoRA/Adapter Loading Error with Qwen3-VL-8B-Instruct Multimodal Model in vLLM Deployment (AssertionError in lora_shrink_op) vllm-project/vllm#28640 — Qwen3-VL-8B AssertionError in lora_shrink_op
Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model transformers#45313 — Qwen3.5 ZeRO-3 weight-gather failure mode

Cost of the current behavior is hours of debug pointed at the wrong layer (usually vLLM, sometimes the model). The validator turns it into a 30-second fix.

Scope

The validator only inspects .lora_A / .lora_B tensors. Legitimately 1-D parameters such as DoRA's lora_magnitude_vector and AdaLoRA's lora_E are not affected. Non-LoRA adapter types (BoFT, OFT, P-tuning, prefix tuning, prompt tuning, etc.) are not touched.

Tests

tests/test_initialization.py::TestSaveValidatesLoraShapes — 7 cases:

happy path (well-formed state dict + DoRA magnitude + non-LoRA bias)
1-D lora_A raises (parameterized: shape (0,) and shape (rank,))
0-sized lora_B raises
error message includes adapter name
end-to-end via save_pretrained(state_dict=...) raises
end-to-end happy path still writes adapter_config.json

Adjacent test classes (TestLoraInitialization, TestNoInfiniteRecursionDeepspeed) verified passing locally: 121/121 + 7/7 new = 128/128. make quality clean.

Backwards compatibility

Pure addition. No public API changes; the helper is _-prefixed and only invoked from inside save_pretrained. The only behavior change for existing code is that what was previously silent corruption now raises with an actionable message — which is the intent of the patch.

Made with Cursor

Refuse to write LoRA adapters whose lora_A / lora_B tensors look unsharded (1-D or zero-sized). This is the canonical signature of an export that ran without gathering DeepSpeed ZeRO-3 / FSDP shards: the on-disk artifact looks structurally valid (filenames + adapter_config.json present), but downstream loaders fail with confusing index errors at first use (e.g. vLLM hot-swap's slice_lora_b in vllm-project/vllm#28640, transformers ZeRO-3 load report failures in huggingface/transformers#45313). The new _validate_lora_adapter_state_dict helper surfaces the failure at write time with an actionable hint pointing at deepspeed.zero.GatheredParameters / FullyShardedDataParallel .summon_full_params, instead of corrupting the artifact and deferring the crash. Scope: only .lora_A / .lora_B keys are inspected. Legitimately 1-D parameters (DoRA's lora_magnitude_vector, AdaLoRA's lora_E) and non-LoRA adapter types are unaffected. Tests in tests/test_initialization.py::TestSaveValidatesLoraShapes cover happy path, 1-D shards, 0-sized tensors, error message content, and the end-to-end save_pretrained() integration. Adjacent test classes (TestLoraInitialization, TestNoInfiniteRecursionDeepspeed) verified passing locally - no regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail loud on save_pretrained() for unsharded LoRA tensors#3251

Fail loud on save_pretrained() for unsharded LoRA tensors#3251
akshansh47 wants to merge 1 commit into
huggingface:mainfrom
akshansh47:fix/validate-lora-shapes-on-save

akshansh47 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akshansh47 commented May 21, 2026

What

Why

Scope

Tests

Backwards compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant