Mica#3260
Conversation
Adds Minor Component Adaptation (https://arxiv.org/abs/2604.01694) as a new init scheme for LoraConfig, triggered by `init_lora_weights="mica"`. Resolves huggingface#3142. MiCA initializes `B = U[:, -r:]` (the r left singular vectors of the base weight associated with the smallest singular values) and `A = 0`. During training only `A` is updated; `B` is frozen. Because `A == 0` at init, the adapter contribution `B @ A` is zero and the forward output is preserved exactly, with no need to mutate the base weight. Implementation: * `LoraConfig.init_lora_weights` accepts `"mica"`. * `LoraLayer.mica_init` performs the SVD-based init for Linear targets and validates `r <= min(in_features, out_features)`. The init is skipped when the adapter parameters are on the meta device (low_cpu_mem_usage path). * `MiCALinearVariant` is a `LoraVariant` that resolves for the MiCA init scheme. Forward and merge semantics are vanilla LoRA; the only override of substance is the new `update_requires_grad` hook. * `LoraVariant.update_requires_grad(module, adapter_name)` is a new entry point on the variant base class. Default is a no-op so existing variants are unaffected. `LoraModel._mark_only_adapters_as_trainable` invokes it for every adapter after the base trainability marking, which is where MiCA freezes `lora_B`. MiCA is currently restricted to `nn.Linear`. Passing `init_lora_weights="mica"` on a non-Linear target raises `ValueError: Unknown initialization` via the existing `reset_lora_parameters` fallback. Tests: * `tests/test_initialization.py` adds 6 MiCA-specific tests covering init correctness, that B is the minor (not major) subspace, B-freeze, train step behavior, save/load round-trip, and the unsupported-layer error. * `tests/test_custom_models.py` adds two parametrized MiCA entries to `TEST_CASES` for broader coverage (save/load, merge/unmerge, autocast). * `tests/testing_common.py` and `tests/test_custom_models.py` relax two assertions that previously required *every* `lora_*` parameter to be trainable / receive gradients, to accommodate variants like MiCA that intentionally freeze a subset. Docs and example: * `docs/source/developer_guides/lora.md` adds a MiCA section. * `examples/mica_finetuning/` provides a runnable example and README. * `method_comparison/MetaMathQA/experiments/lora/llama-3.2-3B-rank32-mica/` registers a benchmark config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for this PR to add MiCA. For this PR, I have focused on the implementation, I haven't checked the example and documentation yet.
I have a couple of smaller comments, but there is also a larger issue, which is that the way that requires_grad is set is not sufficient yet. Right now, it only covers the get_peft_model path, but that's not the only one that can modify requires_grad. Take this example:
from pprint import pprint
import torch
from torch import nn
from peft import LoraConfig, get_peft_model
class SimpleMlp(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 10)
self.fc2 = nn.Linear(10, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
model = SimpleMlp()
config0 = LoraConfig(target_modules=["fc1"], init_lora_weights="mica")
model = get_peft_model(model, config0)
layers_with_requires_grad = [name for name, param in model.named_parameters() if param.requires_grad]
print("Layers with requires_grad=True after first LoRA")
pprint(layers_with_requires_grad)
# correct, should be ['base_model.model.fc1.lora_A.default.weight']
config1 = LoraConfig(target_modules=["fc1", "fc2"], init_lora_weights="mica")
model.add_adapter("other", config1)
model.set_adapter("other")
layers_with_requires_grad = [name for name, param in model.named_parameters() if param.requires_grad]
print("\nLayers with requires_grad=True after switching to other adapter")
pprint(layers_with_requires_grad)
# incorrect, should be ['base_model.model.fc1.lora_A.other.weight', 'base_model.model.fc2.lora_A.other.weight']
model.set_adapter("default")
layers_with_requires_grad = [name for name, param in model.named_parameters() if param.requires_grad]
print("\nLayers with requires_grad=True after switching back to default adapter")
pprint(layers_with_requires_grad)
# incorrect, should be ['base_model.model.fc1.lora_A.default.weight']If you run this, you'll see that the add_adapter and set_adapter path are not covered.
Therefore, I have a different suggestion which implements this feature in a more declarative way, LMK what you think about that:
First, let's remove update_requires_grad completely. Next, on peft.tuners.tuners_utils.BaseTunerLayer, let's add a class attribute frozen_peft_weight_names: dict[str, tuple[str, ...]] = {}. This will contain a mapping from adapter name to the keys of the PEFT weights that should be frozen.
Second, in MiCALinearVariant.init, for the MiCA adapter, we add an entry to frozen_peft_weight_names for LoRA B. Let's ensure not to simply mutate frozen_peft_weight_names, as it's a class attribute. Instead, re-assign a copy of the mutated dict.
Finally, in _mark_only_adapters_as_trainable and in peft.tuners.tuners_utils.set_adapter, we can check if, for a given PEFT layer, there is an entry in frozen_peft_weight_names, and if we find it, set requires_grad = False.
Let's also add unit tests to ensure this works as expected, my example could serve as a template.
Moreover, MiCA currently wouldn't work for LoRA applied to embedding layers, right? I think it should be easy enough to add support for those.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
Thanks for the review. I pushed an update addressing the MiCA trainability issue with a declarative frozen_peft_weight_names mapping on BaseTunerLayer, removing update_requires_grad. The freeze is enforced through initial setup, adapter switching, and set_requires_grad. I also added nn.Embedding support, removed redundant tests, added the r > max_r error test, and added a regression test for switching/reusing adapters. |
Added support for MiCA as proposed in
#3142
What does this PR do?
This PR adds MiCA (Minor Component Adaptation) as a LoRA initialization variant, exposed via:
MiCA initializes LoRA from the minor singular subspace of the base weight matrix. For a target linear layer with weight matrix (W = U \Sigma V^T), MiCA initializes:
B = U[:, -r:]
A = 0
where B contains the r left singular vectors corresponding to the smallest singular values.
Since A is initialized to zero, the adapter contribution is zero at initialization and the base model output is preserved.
Why freeze lora_B?
MiCA treats lora_B as the fixed minor-component subspace and trains only lora_A.
Freezing lora_B preserves the intended MiCA constraint during training. Without this, the adapter would no longer remain constrained to the selected minor subspace and would behave more like an unconstrained LoRA update.
This is why this PR also updates the LoRA variant interface so that variants can customize adapter trainability.
Recommended usage
MiCA is primarily intended for continued pretraining / domain-adaptive pretraining, not for instruction fine-tuning.
The recommended workflow is:
Start from the base model, not the instruct/chat model.
Train the MiCA adapter on continued-pretraining data.
Merge the trained adapter into the model weights.
Use the resulting merged model as the adapted base for subsequent instruction/chat tuning, or merge/apply it before using the corresponding instruct/chat model setup.
This recommendation follows the intended use of MiCA as a method for injecting domain knowledge into pretrained representations while constraining the update to the selected minor-component subspace.
Main changes
Tests
This PR adds tests covering:
Limitations