Skip to content

Documentation re-structure#3300

Open
githubnemo wants to merge 24 commits into
huggingface:mainfrom
githubnemo:feature/doc-restructuring
Open

Documentation re-structure#3300
githubnemo wants to merge 24 commits into
huggingface:mainfrom
githubnemo:feature/doc-restructuring

Conversation

@githubnemo
Copy link
Copy Markdown
Collaborator

@githubnemo githubnemo commented Jun 3, 2026

The current state of the PEFT docs is not one of structure and I was constantly annoyed that whenever I wanted to change something there were several places that needed touching and they all felt disconnected. So this is my attempt at structuring the docs. Some of these ideas are quite old (discussed in 01/2025) but are still valid.

I've removed most of the code guides without replacement. That's not ideal, I think we should have code examples but I'm think they should be method-focused. Maybe one general example of a training workflow is sufficient because most methods follow the same scheme. I'd appreciate some feedback on this.

All details from the method guides (prompting, lora, oft/boft, etc.) are now integrated into the respective method pages instead. I would have hesitated to do this if these guides would have integrated information about the adapters but they didn't. I think it makes a lot more sense to have one place for each method to gather examples/tips/recommendations and that is now the package_refernce/<method> page. This page now also hosts a small space that shows the MetaMathQA (and potentially other) benchmark results highlighted for that method.

I've moved the LoRA initializations to package_reference/lora#Initialization and converted the init methods to <hfoption>-tags. This collapses them to a list but may reduce searchability through the document - at least firefox is not able to search 'through' the option tabs. This also doesn't make them appear in the ToC and people specifically searching for, say, PiSSA won't find it directly. I think that's OK though, since the search is able to locate it.

The quicktour is a bit more detailed about what happens under the hood (quick doesn't have to mean simplistic) and includes some new visualizations. I hope that we can integrate more visualizations in the future where it makes sense.

@BenjaminBossan
Copy link
Copy Markdown
Member

Thanks a lot for revamping the PEFT docs, which I agree are not very user friendly at the moment. Could you please resolve the two merge conflicts so that preview docs could be rendered? I think it makes more sense to review the docs as a whole than going through the diff (which is probably showing a lot of text that has just moved places).

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:

  1. Did you update doc links we may have in PEFT to ensure that they'll be up to date?
  2. How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

nemo added 2 commits June 4, 2026 13:10
The space was not that useful anymore since most methods are compatible
with most models.

The front page buttons are, at least temporarily, with the exception
of the quicktour and method overview buttons. I like the visuals
but there should only be elements that are useful.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@githubnemo
Copy link
Copy Markdown
Collaborator Author

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:

1. Did you update doc links we may have in PEFT to ensure that they'll be up to date?

I didn't at the time but now I have. There were 14 occurrences of now broken links, all of which are fixed now.

2. How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

I've added a _redirects.yml with the most common redirects I found (mostly from transformers). I also checked axolotl, diffusers and unsloth - the latter was not easy to analyze systematically as I couldn't find the docs as plain text, so I resorted to delegating to an agent which didn't find references to the PEFT docs.

The Hermes PEFT skill (https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/mlops/peft) doesn't seem to link to changed pages in the docs.

@githubnemo
Copy link
Copy Markdown
Collaborator Author

@BenjaminBossan
Copy link
Copy Markdown
Member

githubnemo added a commit that referenced this pull request Jun 8, 2026
PR #3300 drafts the idea of embedding the method comparison results into the
respective method pages. This calls for a lighter version of the existing space
to limit the needed space. This is what `app_embed.py` is.

Most of the common processing has moved to the existing and aptly named
`processing.py`.

I think that this is better than having a layout switch in `app.py` as
these apps are meant to be as flat as can be to be readable and maintainable.
@githubnemo githubnemo marked this pull request as ready for review June 8, 2026 12:54
@githubnemo
Copy link
Copy Markdown
Collaborator Author

I think this is now ready for review. Sorry about the huge PR but dissolving the guides into the individual method pages made a relatively big splash in terms of changes, even though the individual changes are quite small.

@stevhliu it would be super cool if you could take a look as well :)

When reviewing the rendered doc on moon-ci-docs I noticed that the new images are rendered with borders (esp. visible in the quicktour) and the ToC indentation for LoRA variants is broken but I have no clue how to fix this. @stevhliu do you have an idea?

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a LOT for working on overhauling the PEFT docs. They always felt lacking and suboptimally structured to me, so I'm very happy to see improvements there.

For this review, I focused on the general sections but haven't reviewed the entries for the individual PEFT methods. This was in order to break down the review in smaller parts, as I'm not going to finish it today. It may also help avoid duplicate effort between me and Steven.

As a more general comment, I saw that some added parts contain manual line breaks, e.g. in overview.md. I would suggest to remove those completely.

I like the idea of including a benchmark overview for each PEFT method. Now that we have image generation too, it would be great to add an option to toggle the benchmark, but let's leave that to a future PR. I noticed, however, that not each PEFT method includes the benchmark, e.g. HRA is missing it. Also, some methods like HiRA have the graph but no corresponding data points, but maybe its result was added after the space was deployed?

I also wonder if we should not fully remove the legend, as the resulting graph can become quite cramped:

Image

There is also a bit of an inconsistency about the legend, e.g. for Lily it only labels the line but not the points. I think it should be removed for simplicity.

Comment thread docs/source/index.md
<div class="flex flex-col basis-1/4">
There are numerous methods to "adapt" existing models, often extensively integrating into the model. PEFT can be thought of as a framework for arbitrary methods of model adaption (modifying weights, wrapping layers, manipulating KV-caches, ...) while also serving as a reference implementation for many fine-tuning methods.
</div>
<div class="flex flex-col basis-3/4 pl-10 pr-10"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/adapter_installation.png" width="100%"></div>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit odd (middle part) with dark theme:

Image

Comment thread docs/source/quicktour.md

## Multiple adapters

PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.
PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters as you want by calling `peft_model.add_adapter(adapter_name=...)`.

Comment thread docs/source/quicktour.md
model = AutoPeftModel.from_pretrained("smangrul/openai-whisper-large-v2-LORA-colab")
```

## Multiple adapters
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the section above, the docs describe the AutoPeftModel API for loading trained adapters. I'm just wondering if we should not at the very least mention the PeftModel.from_pretrained(base_model, adapter_id) API as well.


## Choosing the right method

Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as is, the last sentence doesn't quite make sense, even though it's clear what is meant. Here is a suggestion for a different wording.

Suggested change
Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.
Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length; some methods are more prone to memory spikes than others.


Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).

## Chunked NLL loss
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this section last, I think the other ones below are more generally applicable.


## Quantization

Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).
Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incurred by quantization methods. Read the [PEFT quantization guide](quantization).


## Gradient Checkpointing

You can trade memory with computation by only saving every nth gradient between layers and computing the rest on the fly. Check out the [gradient checkpointing](https://huggingface.co/docs/transformers/grad_checkpointing) documentation of Transformers to learn more.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that if not using Transformers or Diffusers, users may have to implement their own GC logic.

Giving general advice for training large models is hard but for generative
models, especially language models, you can follow these steps:

1. use prompting (few-shot examples in the prompt) to see if the model is
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. use prompting (few-shot examples in the prompt) to see if the model is
1. use prompting (e.g. few-shot examples in the prompt) to see if the model is

fine-tuning step is potentially unlearning past knowledege.

The [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) aims to give a rough overview of (most) implemented methods on selected benchmarks and models.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be useful to mention some criteria here that may guide you in choosing the appropriate PEFT method:

  • quantization: not all methods support quantized base models
  • feature set: not all features are supported for all methods (e.g. multiple adapters, mixed adapter inference)
  • layer types: linear layers are generally always supported, but not all methods support embedding (important for expanding vocab) or conv (important for some image models)
  • inference runtime: PEFT methods generally add runtime overhead but some of that can be mitigated (e.g. some methods allow merging, removing the overhead)


## Layer Tuning

Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"target specific layers" doesn't make it quite clear that it means that existing parameters of the base model are made trainable, since you could say that LoRA also targets specific layers. I would state that explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants