Skip to content

Modded GPT pseudoinverse improvement #265

Description

@rohan-anil

https://x.com/_arohan_/status/2064216022232826080

The main change that improved things was replacing:

  inv_power_L = L.pow(-1.0 / root)

with a truncated inverse-power version:

  positive_eigenvalue_mask = L > 1e-15
  inv_power_L = torch.zeros_like(L)
  inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

My read is that very small eigenvalues were getting amplified too aggressively by the inverse root, and truncating them behaves more like a pseudoinverse / rank-truncated inverse. That seems to work better for this problem. This problem has been dominating the twitter conversations around optimizers.

I saw a few related stability options in the codebase. Is there another configuration you would recommend trying, or should we add explicit support for this pseudoinverse-style truncated inverse?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions