Bug: prepare_model_for_kbit_training() never called layer norms stay in wrong dtype #57

newtscammander · 2026-06-04T08:27:55Z

newtscammander
Jun 4, 2026

Found another issue in hf_deploy/trainer/qlora.py.

The docstring says step 2 is "Prepare for k-bit training (cast layer norms to fp32)" but the actual prepare_model_for_kbit_training() call from peft is missing entirely:

def prepare_qlora_model(model_cfg: ModelConfig, lora_cfg: LoraConfig):
    """
    Full QLoRA setup:
      1. Load model with 4-bit NF4 quantization
      2. Prepare for k-bit training (cast layer norms to fp32)  # <-- documented but never done
      3. Inject LoRA adapters
    """
    model_cfg.use_4bit = True
    model, tokenizer = load_model_and_tokenizer(model_cfg)
    model = inject_lora(model, lora_cfg)  # goes straight to LoRA injection
    return model, tokenizer

Without prepare_model_for_kbit_training, the layer norms and other non-quantized components stay in float16 instead of being upcast to float32. This causes training instability — gradients through layer norms can underflow in fp16, especially noticeable with longer training runs or larger models like Llama 3 8B.

The fix:

from peft import prepare_model_for_kbit_training

def prepare_qlora_model(model_cfg: ModelConfig, lora_cfg: LoraConfig):
    model_cfg.use_4bit = True
    model, tokenizer = load_model_and_tokenizer(model_cfg)
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=True
    )
    model = inject_lora(model, lora_cfg)
    return model, tokenizer

Also worth noting finetune.py already sets gradient_checkpointing=True in TrainingArguments, but without this call the gradient checkpointing hooks aren't properly set up on the quantized model. Passing use_gradient_checkpointing=True here ensures they're wired correctly before LoRA injection.

Saw noticeably more NaN losses without this fix on Mistral 7B after ~500 steps. Happy to open a PR.

Answered by SahilKumar75

Jun 4, 2026

Good catch, confirmed on my end too.

The call was in the docstring but never actually made it into the code classic case of writing the plan and forgetting to execute it.

Fix is straightforward, add the import and the call between loading and LoRA injection:

from peft import prepare_model_for_kbit_training

def prepare_qlora_model(model_cfg: ModelConfig, lora_cfg: LoraConfig):
    model_cfg.use_4bit = True
    model, tokenizer = load_model_and_tokenizer(model_cfg)
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=True
    )
    model = inject_lora(model, lora_cfg)
    return model, tokenizer

And yeah the gradient checkpointing thing you flagged…

View full answer

SahilKumar75 · 2026-06-04T08:29:38Z

SahilKumar75
Jun 4, 2026
Maintainer

Good catch, confirmed on my end too.

The call was in the docstring but never actually made it into the code classic case of writing the plan and forgetting to execute it.

Fix is straightforward, add the import and the call between loading and LoRA injection:

from peft import prepare_model_for_kbit_training

def prepare_qlora_model(model_cfg: ModelConfig, lora_cfg: LoraConfig):
    model_cfg.use_4bit = True
    model, tokenizer = load_model_and_tokenizer(model_cfg)
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=True
    )
    model = inject_lora(model, lora_cfg)
    return model, tokenizer

And yeah the gradient checkpointing thing you flagged is real setting it in TrainingArguments doesn't do the same job. prepare_model_for_kbit_training actually upcasts the layer norms to fp32 and wires the checkpointing hooks onto the quantized model before LoRA goes in. Skipping it means you're training with layer norms in fp16 the whole time which is where the NaN losses come from.

PR welcome, small change.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: prepare_model_for_kbit_training() never called layer norms stay in wrong dtype #57

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Bug: prepare_model_for_kbit_training() never called layer norms stay in wrong dtype #57

Uh oh!

newtscammander Jun 4, 2026

Replies: 1 comment

Uh oh!

SahilKumar75 Jun 4, 2026 Maintainer

newtscammander
Jun 4, 2026

SahilKumar75
Jun 4, 2026
Maintainer