Skip to content

Question about orthogonalization granularity for multi-head attention weights #42

Description

@Faded-Nebula

Hi, thanks for releasing the Muon implementation and examples.

I noticed that in the provided attention example, the optimizer applies Muon to the concatenated Q/K/V projection matrices (e.g., Wq of shape [d_model, n_heads * d_head]) rather than performing orthogonalization per attention head (i.e., per [d_model, d_head] block).

I would like to ask about the rationale for this design choice.

From my understanding, Muon is defined as a matrix-level optimizer, so treating each projection matrix as a single 2-D parameter is consistent with the theory. However, concatenating all heads together also prevents per-head decoupling and implicitly assumes that cross-head correlations should be preserved during the orthogonalization step.

Could you clarify the reasoning behind optimizing the concatenated matrices instead of head-wise blocks? Is this primarily a mathematical consideration, a stability constraint, or an engineering/performance decision?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions