Skip to content
Open

Mica #3260

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/source/developer_guides/lora.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,19 @@ lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...)
```
For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/pissa_finetuning).

### MiCA

[MiCA](https://arxiv.org/abs/2604.01694) (Minor Component Adaptation) is a complement to PiSSA: instead of initializing from the *principal* singular components, MiCA uses the *minor* ones. Concretely, with `W = U Σ V^T`, MiCA sets `B = U[:, -r:]` (the `r` left singular vectors associated with the smallest singular values) and `A = 0`. During training, only `A` is updated; `B` is frozen. The intuition is that the minor singular directions are largely unused by the pretrained task and therefore offer a more "plastic" subspace for injecting new knowledge while preserving pretrained capabilities.

Because `A == 0` at init, the adapter contribution `B · A == 0` and the model output is preserved exactly at step 0 — no residual subtraction on the base weight is needed (unlike PiSSA). Since only `A` is trainable, the trainable parameter count for matching `r` is roughly half that of LoRA.

```python
from peft import LoraConfig
config = LoraConfig(init_lora_weights="mica", r=16, target_modules=["q_proj", "v_proj"], ...)
```

MiCA currently supports `nn.Linear` and `nn.Embedding` target modules. The chosen rank must satisfy `r <= min(in_features, out_features)` for linear layers and `r <= min(num_embeddings, embedding_dim)` for embedding layers. For detailed usage, see [these instructions](https://github.com/huggingface/peft/tree/main/examples/mica_finetuning).

### CorDA

[CorDA](https://huggingface.co/papers/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
Expand Down
80 changes: 80 additions & 0 deletions examples/mica_finetuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# MiCA: Minor Component Adaptation

## Introduction ([Paper](https://arxiv.org/abs/2604.01694))

Minor Component Adaptation (MiCA) is a parameter-efficient fine-tuning method closely related to LoRA. Like LoRA, MiCA inserts a low-rank update `ΔW = (α/r) · B · A` into a pretrained weight `W ∈ R^{out×in}`. Unlike LoRA, MiCA initializes the matrices from the singular value decomposition of `W` and trains only one of them:

- Compute the SVD `W = U Σ V^T`.
- Initialize `B = U[:, -r:]` — the `r` left singular vectors associated with the **smallest** singular values.
- Initialize `A = 0`.
- During training, optimize only `A`; `W` and `B` remain frozen.

The motivation is that the *minor* singular directions of a pretrained weight encode subspaces that are largely unused by the original task. Restricting adaptation to these directions provides a more "plastic" subspace for knowledge injection, with less risk of overwriting capabilities encoded in the dominant subspace. Empirically MiCA improves knowledge acquisition while reducing the trainable parameter footprint compared with LoRA at the same rank (because only `A` is trained, the parameter count is roughly halved for matching `r`).

Because `A == 0` at initialization, the adapter contribution `B · A == 0` and the model's forward output is preserved exactly at step 0 — no residual subtraction is needed on the base weight.

## Quick Start

```python
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token_id = tokenizer.eos_token_id

lora_config = LoraConfig(
init_lora_weights="mica",
r=16,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

dataset = load_dataset("imdb", split="train[:1%]")
training_args = SFTConfig(dataset_text_field="text", max_length=128)
trainer = SFTTrainer(
model=peft_model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
peft_model.save_pretrained("mica-llama-2-7b")
```

To reload the trained adapter:

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", dtype=torch.bfloat16, device_map="auto"
)
peft_model = PeftModel.from_pretrained(model, "mica-llama-2-7b")
```

## Notes and limitations

- MiCA currently supports `nn.Linear` and `nn.Embedding` target modules.
- The chosen rank must satisfy `r <= min(in_features, out_features)` for linear layers and `r <= min(num_embeddings, embedding_dim)` for embedding layers; otherwise initialization raises `ValueError`.
- MiCA performs a full SVD per target weight at initialization. For 7B-scale models this is a one-time cost of seconds; for substantially larger weight matrices (e.g. 70B-scale) the cost grows.
- Combining MiCA with `use_dora=True` or other LoRA variants is not supported in this initial integration.

## Citation

```
@article{rudiger2026mica,
title={MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning},
author={R{\"u}diger, Sten and Raschka, Sebastian},
journal={arXiv preprint arXiv:2604.01694},
year={2026}
}
```
80 changes: 80 additions & 0 deletions examples/mica_finetuning/mica_finetuning.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Copyright 2023-present the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Minimal MiCA fine-tuning example.

Mirrors `examples/pissa_finetuning/pissa_finetuning.py` in spirit but with the MiCA-specific knobs only. MiCA
initializes `B` from the bottom-r left singular vectors of the base weight and freezes it during training; only
`A` is updated. Because `A == 0` at init, the adapter is a no-op on initialization and no residual subtraction
on the base weight is needed.
"""

from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
from trl import SFTConfig, SFTTrainer

from peft import LoraConfig, get_peft_model


@dataclass
class ScriptArguments(SFTConfig):
base_model_name_or_path: Optional[str] = field(default=None, metadata={"help": "Name or path of the base model."})
lora_r: int = field(default=16)
lora_alpha: int = field(default=16)
lora_dropout: float = field(default=0.0)
target_modules: Optional[str] = field(
default="q_proj,v_proj",
metadata={"help": "Comma-separated module names to adapt with MiCA."},
)
data_path: str = field(default="imdb", metadata={"help": "HF dataset path."})
dataset_split: str = field(default="train[:1%]")
dataset_text_field: str = field(default="text")


def train():
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]

model = AutoModelForCausalLM.from_pretrained(args.base_model_name_or_path, dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id

lora_config = LoraConfig(
init_lora_weights="mica",
r=args.lora_r,
lora_alpha=args.lora_alpha,
lora_dropout=args.lora_dropout,
target_modules=[m.strip() for m in args.target_modules.split(",")],
task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

dataset = load_dataset(args.data_path, split=args.dataset_split)
trainer = SFTTrainer(
model=peft_model,
args=args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
peft_model.save_pretrained(args.output_dir)


if __name__ == "__main__":
train()
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": null,
"bias": "none",
"corda_config": null,
"eva_config": null,
"exclude_modules": null,
"fan_in_fan_out": false,
"inference_mode": false,
"init_lora_weights": "mica",
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 64,
"lora_bias": false,
"lora_dropout": 0.0,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 32,
"rank_pattern": {},
"revision": null,
"target_modules": null,
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
}
24 changes: 20 additions & 4 deletions src/peft/tuners/lora/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,7 @@ class LoraConfig(PeftConfig):
use the original default value of `lora_alpha/r`.
modules_to_save (`List[str]`):
List of modules apart from adapter layers to be set as trainable and saved in the final checkpoint.
init_lora_weights (`bool` | `Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq", "orthogonal"]`):
init_lora_weights (`bool` | `Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq", "orthogonal", "mica"]`):
How to initialize the weights of the adapter layers. Passing True (default) results in the default
initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0.
This means that without further training, the LoRA adapter will be a no-op. Setting the initialization to
Expand All @@ -430,7 +430,10 @@ class LoraConfig(PeftConfig):
converges even more rapidly than PiSSA in Instruction-Previewed Mode, and preserves world knowledge better
than LoRA in Knowledge-Preserved Mode. Passing `"orthogonal"` results in LoRA A and B being intialized
orthogonally; in this, it resembles `"olora"`, but the base weights are left untouched (requires `r` to be
even, only supported for linear layers for now).
even, only supported for linear layers for now). Passing `"mica"` results in the initialization of <a
href='https://arxiv.org/abs/2604.01694' >Minor Component Adaptation (MiCA)</a>, which initializes B from
the r left singular vectors of the base weight associated with the smallest singular values, sets A to
zero, and freezes B during training; only A is updated. Currently supported for linear and embedding layers.
layers_to_transform (`Union[List[int], int]`):
The layer indices to transform. If a list of ints is passed, it will apply the adapter to the layer indices
that are specified in this list. If a single integer is passed, it will apply the transformations on the
Expand Down Expand Up @@ -566,7 +569,17 @@ class LoraConfig(PeftConfig):
)
init_lora_weights: (
bool
| Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq", "orthogonal"]
| Literal[
"gaussian",
"eva",
"olora",
"pissa",
"pissa_niter_[number of iters]",
"corda",
"loftq",
"orthogonal",
"mica",
]
) = field(
default=True,
metadata={
Expand All @@ -586,7 +599,10 @@ class LoraConfig(PeftConfig):
"nonnegative integer. "
"Passing `'corda'` results in CorDA initialization. "
"Pass `'loftq'` to use LoftQ initialization. "
"Pass `'orthogonal'` for orthogonal initialization of LoRA A and B."
"Pass `'orthogonal'` for orthogonal initialization of LoRA A and B. "
"Pass `'mica'` to use MiCA initialization, where B is set to the r left singular vectors of the "
"base weight associated with the smallest singular values, A is set to zero, and B is frozen during "
"training (only A is updated)."
),
},
)
Expand Down
86 changes: 85 additions & 1 deletion src/peft/tuners/lora/layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,10 @@ def __init__(self, base_layer: nn.Module, ephemeral_gpu_offload: bool = False, *
self.in_features = in_features
self.out_features = out_features

def delete_adapter(self, adapter_name: str) -> None:
super().delete_adapter(adapter_name)
self.lora_variant.pop(adapter_name, None)

def _get_in_out_features(self, module: nn.Module) -> tuple[int, int] | tuple[None, None]:
return _get_in_out_features(module)

Expand Down Expand Up @@ -231,6 +235,9 @@ def update_layer(
elif isinstance(init_lora_weights, str) and init_lora_weights.lower() == "olora":
with gather_params_ctx(self.get_base_layer().weight):
self.olora_init(adapter_name)
elif isinstance(init_lora_weights, str) and init_lora_weights.lower() == "mica":
with gather_params_ctx(self.get_base_layer().weight):
self.mica_init(adapter_name)
elif init_lora_weights == "loftq":
with gather_params_ctx(self.get_base_layer().weight):
self.loftq_init(adapter_name, config)
Expand Down Expand Up @@ -395,6 +402,41 @@ def pissa_init(self, adapter_name, init_lora_weights):
weight = transpose(weight.to(dtype), self.fan_in_fan_out)
self.get_base_layer().weight.data = weight

def mica_init(self, adapter_name):
"""Minor Component Adaptation (MiCA) initialization (https://arxiv.org/abs/2604.01694).

Initializes `lora_B` from the `r` left singular vectors of the base weight associated with the smallest
singular values, and sets `lora_A` to zero. The `lora_B` matrix is frozen during training (see
`MiCALinearVariant.init`); only `lora_A` is updated. Because `lora_A == 0` at init, the adapter
contribution `B @ A == 0` and the base weight does not need to be modified to preserve the forward output.
"""
# When the adapter is being created under `init_empty_weights` (e.g. low_cpu_mem_usage=True), its parameters
# live on the meta device and will be filled in from a checkpoint after creation. Skip the SVD in that case.
if self.lora_B[adapter_name].weight.device.type == "meta":
return
Comment thread
sr-networks marked this conversation as resolved.

weight = self.get_base_layer().weight
dtype = weight.dtype
if dtype not in [torch.float32, torch.float16, torch.bfloat16]:
raise TypeError("Please initialize MiCA under float32, float16, or bfloat16.")

weight = transpose(weight.to(torch.float32), self.fan_in_fan_out)
# weight has shape (out_features, in_features) once transposed for fan_in_fan_out, matching nn.Linear.weight.
# SVD: weight = U @ diag(S) @ Vh, with U: (out, k), Vh: (k, in), S sorted descending.
# MiCA selects the LAST r left singular vectors (smallest singular values) for B and zeroes A.
r = self.r[adapter_name]
max_r = min(weight.shape)
if r > max_r:
raise ValueError(
f"MiCA requires `r` <= min(in_features, out_features) but got r={r} for a layer with "
f"weight shape {tuple(weight.shape)} (max usable r is {max_r})."
)
U, _, _ = torch.linalg.svd(weight.data, full_matrices=False)
lora_B = U[:, -r:].contiguous()
lora_A = torch.zeros(r, weight.shape[1], device=weight.device)
self.lora_B[adapter_name].weight.data = lora_B.to(dtype)
self.lora_A[adapter_name].weight.data = lora_A.to(dtype)

def corda_init(self, adapter_name, init_lora_weights):
linear = self.get_base_layer()
weight = linear.weight
Expand Down Expand Up @@ -815,6 +857,11 @@ def resolve_lora_variant(self, config: LoraConfig, **kwargs) -> Optional[LoraVar

return BdLoraLinearVariant()

if isinstance(config.init_lora_weights, str) and config.init_lora_weights.lower() == "mica":
from .variants import MiCALinearVariant

return MiCALinearVariant()

use_alora = config.alora_invocation_tokens is not None
if not config.use_dora and not use_alora:
return None
Expand Down Expand Up @@ -1064,6 +1111,10 @@ def __init__(
def resolve_lora_variant(self, *, config: LoraConfig, **kwargs) -> Optional[LoraVariant]:
if config.velora_config is not None:
raise ValueError("VeLoRA does not support adapting embedding layers.")
if isinstance(config.init_lora_weights, str) and config.init_lora_weights.lower() == "mica":
from .variants import MiCAEmbeddingVariant

return MiCAEmbeddingVariant()
if not config.use_dora:
return None

Expand Down Expand Up @@ -1116,7 +1167,10 @@ def update_layer(

self.use_dora[adapter_name] = config.use_dora

if init_lora_weights == "loftq":
if isinstance(init_lora_weights, str) and init_lora_weights.lower() == "mica":
with gather_params_ctx(self.get_base_layer().weight):
self.mica_init(adapter_name)
elif init_lora_weights == "loftq":
self.loftq_init(adapter_name)
elif init_lora_weights == "lora_ga":
# Embedding layers don't support LoRA-GA, fall back to standard initialization
Expand Down Expand Up @@ -1145,6 +1199,36 @@ def output_fn(outputs):
self.input_fns[adapter_name] = input_fn
self.output_fns[adapter_name] = output_fn

def mica_init(self, adapter_name):
"""Minor Component Adaptation (MiCA) initialization for embedding layers.

The effective embedding projection has shape `(embedding_dim, num_embeddings)`, so MiCA initializes
`lora_embedding_B` from the minor left singular vectors of `base_layer.weight.T` and sets
`lora_embedding_A` to zero.
"""
if self.lora_embedding_B[adapter_name].device.type == "meta":
return

weight = self.get_base_layer().weight
dtype = weight.dtype
if dtype not in [torch.float32, torch.float16, torch.bfloat16]:
raise TypeError("Please initialize MiCA under float32, float16, or bfloat16.")

weight = weight.to(torch.float32).T
r = self.r[adapter_name]
max_r = min(weight.shape)
if r > max_r:
raise ValueError(
f"MiCA requires `r` <= min(num_embeddings, embedding_dim) but got r={r} for an embedding layer with "
f"weight shape {tuple(self.get_base_layer().weight.shape)} (max usable r is {max_r})."
)

U, _, _ = torch.linalg.svd(weight.data, full_matrices=False)
lora_embedding_B = U[:, -r:].contiguous()
lora_embedding_A = torch.zeros(r, weight.shape[1], device=weight.device)
self.lora_embedding_B[adapter_name].data = lora_embedding_B.to(dtype)
self.lora_embedding_A[adapter_name].data = lora_embedding_A.to(dtype)

def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
"""
Merge the active adapter weights into the base weights
Expand Down
Loading