Enable DINOv3 PP microbatching + gradient accumulation with loss scaling and validation#5
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
pp>1due to an impossible microbatch equality guard and lack of gradient-accumulation support, which prevented multi-stage microbatching required by PyTorch PP.Description
pipeline_dinov3with TorchTitan-compatible validation that enforcespipeline_parallel_microbatch_size > 0,training.local_batch_size % pipeline_parallel_microbatch_size == 0, andn_microbatches = local_batch_size // pipeline_parallel_microbatch_size >= pipeline_parallel_degree(file:dinov3/infra/pipeline.py).loss_scale: float = 1.0plumbing into the model/training path so the backward pass is scaled correctly under PP microbatching and trainer GA:SSLMetaArch.forward_backward(..., loss_scale)(file:dinov3/train/ssl_meta_arch.py),DinoV3PipelineStageModel.forward(..., loss_scale)(file:dinov3/infra/pipeline.py) andDinoV3SSLModel.forward_backward(..., loss_scale)(file:dinov3/model/model.py).DinoV3Trainerby removing the previousNotImplementedguard, introducing_ga_steps, iterating GA microbatches per optimizer step, and applying loss scaling formulas: non-PPloss_scale = 1/ga_stepsand PPloss_scale = 1/(ga_steps * pp_microbatches_per_step)wherepp_microbatches_per_step = local_batch_size / pipeline_parallel_microbatch_size(file:dinov3/trainer.py).effective_local_batch = local_batch_size * gradient_accumulation_stepsandeffective_global_batch = effective_local_batch * data_parallel_degree(file:dinov3/trainer.py).docs/02-how-training-works.md,docs/10-debugging-troubleshooting.md).tests/test_dinov3_pipeline_dispatch.py(microbatch divisibility &n_microbatches >= pp_degree),tests/test_dinov3_trainer_pp.py(PP+GA loss scaling and non-PP GA behavior), andtests/test_dinov3_trainer_metrics_accounting.py(GA throughput accounting). Several source files were modified:dinov3/infra/pipeline.py,dinov3/train/ssl_meta_arch.py,dinov3/model/model.py,dinov3/trainer.py.Testing
python -m compileallon modified modules and tests which succeeded for the updated files.pytestin the current environment which failed to collect tests due to missing runtime packages (torchandtorchtitan), so full unit test execution could not be completed here; the new/updated unit tests target the PP validation, PP+GA trainer behavior, non-PP GA behavior, and throughput accounting and should pass in an environment with the project dependencies installed.loss_scaleplumbing follow the documented formulas and are exercised by the added unit tests (to be executed in CI or a full dev environment).Codex Task