Skip to content

Add: GradAccumulator wrapper struct for optimizer#28

Open
DrJesseGlass wants to merge 3 commits into
mainfrom
grad-optimizer-opt-accum
Open

Add: GradAccumulator wrapper struct for optimizer#28
DrJesseGlass wants to merge 3 commits into
mainfrom
grad-optimizer-opt-accum

Conversation

@DrJesseGlass

Copy link
Copy Markdown
Owner

Adds GradAccumulator<O: Optimizer>, a wrapper for optimizers that enables gradient accumulation during training. Call accumulate(&loss) once per micro-batch, then step() once to apply a single averaged update — effective batch size K × micro_batch without holding K computation graphs in memory simultaneously

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00eba4a267

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread candle-nn/src/optim.rs Outdated
Comment on lines +305 to +307
let Some(mut store) = self.accum.take() else {
return Ok(());
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve accumulated gradients across failed step calls

If self.opt.step(&store) returns an error (for example due to a backend/device failure), the gradients are already removed from self.accum via take(), so they are dropped and cannot be retried, while self.count remains non-zero because it is reset only after a successful step. This leaves the accumulator in an inconsistent state (pending() > 0 but no stored grads) and can silently lose an update after transient failures.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant