https://x.com/_arohan_/status/2064216022232826080
The main change that improved things was replacing:
inv_power_L = L.pow(-1.0 / root)
with a truncated inverse-power version:
positive_eigenvalue_mask = L > 1e-15
inv_power_L = torch.zeros_like(L)
inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)
My read is that very small eigenvalues were getting amplified too aggressively by the inverse root, and truncating them behaves more like a pseudoinverse / rank-truncated inverse. That seems to work better for this problem. This problem has been dominating the twitter conversations around optimizers.
I saw a few related stability options in the codebase. Is there another configuration you would recommend trying, or should we add explicit support for this pseudoinverse-style truncated inverse?
https://x.com/_arohan_/status/2064216022232826080
The main change that improved things was replacing:
with a truncated inverse-power version:
My read is that very small eigenvalues were getting amplified too aggressively by the inverse root, and truncating them behaves more like a pseudoinverse / rank-truncated inverse. That seems to work better for this problem. This problem has been dominating the twitter conversations around optimizers.
I saw a few related stability options in the codebase. Is there another configuration you would recommend trying, or should we add explicit support for this pseudoinverse-style truncated inverse?