Skip to content

Error when saving FSDP weights with cpu_offload=True [rank1]: AttributeError: 'Params4bit' object has no attribute 'absmax' #1405

@PanagiotisFytas

Description

@PanagiotisFytas

System Info

  • Python 3.10
  • torch==2.4.1 and torch==2.5.1+cu121
  • bitsandbytes==0.44.1
  • llama-recipes 0.4.0.post1 and 0.4.0

Reproduction

While running:

torchrun --nnodes 1 --nproc_per_node 2 recipes/quickstart/finetuning/finetuning.py \
    --use_peft \
    --peft_method lora \
    --model_name 'meta-llama/Llama-3.1-70B-Instruct' \
    --output_dir './my_lora_weights/70B' \
    --batch_size_training 1 \
    --batching_strategy "padding" \
    --weight_decay 0.2 \
    --num_epochs 10 \
    --dataset custom_dataset
    --quantization '4bit' \
    --enable_fsdp True 
    --use_fast_kernels True

The code that leads to the error is from llama-recipes (https://github.com/meta-llama/llama-recipes/blob/98707b72fda091b2b20e3ab2ffaf9a86e4fccd84/src/llama_recipes/model_checkpointing/checkpoint_handler.py#L273):

def save_peft_checkpoint(model, model_path):
    """save_pretrained peft model"""

    options = StateDictOptions(full_state_dict=True, cpu_offload=True)
    
    if isinstance(model, FSDP):
        state_dict = get_model_state_dict(model, options=options)
        model.save_pretrained(model_path, state_dict=state_dict)
    else:
        model.save_pretrained(model_path)

Expected behavior

...
[rank1]: File "/home/Documents/llama-recipes/src/llama_recipes/utils/train_utils.py", line 259, in train
[rank1]: save_peft_checkpoint(model, train_config.output_dir)
[rank1]: File "/home/Documents/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py", line 276, in save_peft_checkpoint
[rank1]: state_dict = get_model_state_dict(model, options=options)
[rank1]: File "/home/miniconda3/envs/llama_recipes_new/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 995, in get_model_state_dict
[rank1]: model_state_dict = _get_model_state_dict(model, info)
[rank1]: File "/home/miniconda3/envs/llama_recipes_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/miniconda3/envs/llama_recipes_new/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank1]: fqns = _get_fqns(model, key)
[rank1]: File "/home/miniconda3/envs/llama_recipes_new/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank1]: curr_obj = getattr(curr_obj, curr_obj_name)
[rank1]: AttributeError: 'Params4bit' object has no attribute 'absmax'

Apparently as per meta-llama/llama-cookbook#674 a temporary fix is making cpu_offload=False but this is only a bandaid fix that disables CPU offloading

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions