Skip to content

Two questions about the learning rate adjustment #40

Description

@AlexDeMoor

Hello,

I do have 2 questions I couldn't find explanations in your tech report nor other discussions about the adjusted learning rate.

  1. First I do understand you flatten in 2D your gradient when it's dim > 2. However, when updating the lr, you define A and B as the 2 first dimension of your parameter matrix. Doesn't it leads to a mismatch of the value B if p.dim > 2 ? Wouldn't it be necessary for completeness to flatten your parameter matrix shape (like p.view(p.size(0), -1).shape) before computing A and B ?
  2. Secondly, it seems you don't apply the adjusted learning rate during weight decay : p.data.mul_(1 - lr * wd). Is this wanted and if yes, could you elaborate on this ? Is this how Adam(W) apply weight decay ?

Kind regards,

Alexandre De Moor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions